Decision Tree Applet
This applet allows you to build interactively decision trees and to visualize them. In order to run it, either you use the on-line version or you can download the jar file. This applet requires Java version 1.1.5 at least.
- On-line version
- Download
- A simple example
- Short documentation
- Decision tree method implementation
- Data base format
On-line version
Choose one database :- German credit allocation DB
- Small database containing animal descriptions
- Survival on Titanic catastrophe
- Tennis playing example used by Quinlan to illustrate ID3
- Small Electric power systems stability database
- Iris database
Download
dtapplet.tar.gz (118574 bytes)This file contains :
- DT.jar - the jar file containing the java classes and sources (under the GNU General Public License)
- dtapplet.html - load it with appletviewer or with your browser to launch the applet
- databases - a directory containing all the example databases
A simple example
First, launch the applet with the zoo database. This database contains a hundred of animals, each one described by 17 variables. The first 16 variables are physical characteristics of the animal (whether or not it gives milk, whether or not it has hair,...). The last one is the type (among 7) of this animal (mammal, fish, bird, shellfish, insect, frog or reptile).
Suppose you want to build a tree which tries to predict the type of an animal from its physical characteristics. First, select ``type'' as the goal attribute and all the remaining attributes as candidate attributes (in the top right box). Then choose a learning set. By default, two third of the database (67 animals) is selected at random. To start tree building, click on the button ``new tree''. At this point, either you can completely build the tree in one step by clicking on the ``build'' button or observe the induction step by step by repetitively clicking on the ``split'' button (until there are no more ``open'' node).
To test the resulting tree, select all the cases which do not belong to the
learning set by selecting the ``not in ls'' selection mode for the test set (on
the bottom of the applet screen) and then click on ``test tree''. The
percentage of errors appears in the text box to the right of the ``test
tree'' button.
Short documentation
The screen of the applet is divided vertically into four parts.
On the top are the input parameters of the tree :
- Database url
- an url. Click on the Load Database button to load this database
- Goal Attribute
- the goal attribute. You can select only one symbolic attribute.
- Candidate attributes
- This list box allows one not to use some attributes.
.- Entropy Threshold and Inf x LS size Threshold
- These parameters determine when the splitting of a node is stopped. The first one is a threshold on the class entropy in the node, the second one is a threshold on the product of the information gain of the best test and the size of the local learning set in this node.
- Learning Set
- There are three ways of selecting a learning set : random, first (we select the n first examples of the database) and last (we select the last n examples of the database).
When all the input parameters are tuned, click the New tree button to make the top node appear on the center part of the screen. In this part, it's possible to select any node by clicking on it. Then some information about it appears in the field below the canvas.
On the next part of the screen are the controls for the tree building :
Finally, on the bottom are the test control buttons.
- Scan attributes
- scans all the attributes and compute the best test for each of them in the selected node. The result of the scanning is displayed in the selection box above this line of buttons
- Split
- splits the current node according to the selected choice in the scanning selection box. If Scan attributes has not been used, the best test is automatically chosen.
- Build
- builds entirely the tree.
- Simple and Detailed
- two different ways of displaying a tree. In the latter, each node is represented by a box divided into different vertical color bands, each one representing the proportion of one class in this node.
- Refresh
- refresh the tree.
- Zoom
- ...
- Test set
- There are five ways to select a test set : the same three as for the learning set, plus the learning set itself, and all the objects which don't belong to the learning set ('Not in ls').
- Test tree
- tests the tree even if the building is not achieved.
- Cross validation
- assess the error rate of the tree by cross validation. Takes into account only the test set size argument. Partitions randomly the learning set into sets of the size of the test set, then for each small set, builds a tree with the remaining objects of the learning set and computes the error rate obtained on this small set. returns the average error rate.
Decision tree method implementation :
A decision tree is a classification model, ie represents a relationship between some numerical or symbolic inputs (the attributes) and one symbolic output. In a decision tree, each interior node is labeled with a test build upon one attribute and each leaf is labeled with a class (one value of the output). To know the output or class associated to some input values, one starts at the top node of the tree and apply sequentially the tests encountered to select the appropriate successor. Finally a unique terminal node is reached and the class stored there is assigned to this case.
A decision tree is build from a set of examples of the relationship (also called objects). We start with only one node which contains all the objects. Then, we recursively split the nodes, selecting at each step an attribute and a test build upon this attribute. This test is selected in order to split the subset corresponding to this node into non-overlapping sets where the objects have as much as possible the same class. To achieve this goal, a score measure is used to assess the tests. Here, the score measure is the information gain. The splitting of a node is stopped in two cases : when the class entropy in this node is low enough, (i.e. is lower that a given threshold. If this threshold is 0, then the splitting is stopped when all the objects belong to the same class) or when the information gain of the best test multiplied by the size of the object set is too low (i.e. again is lower than a given threshold). In the latter case, it means that the best test is not statistically reliable and it's better not to split the node.
Usually, this growing phase is followed by a pruning phase using
another validation set but this part of the process is not implemented
here.
Database format :
The database format is very simple. The first line is the name of the database. The second line contains the attribute names immediately followed by their types : numerical or symbolic. A numerical attribute can be real or integer (anyway it will be converted to a java float value). A symbolic attribute denotes any discrete attribute. Its value can be symbols, numbers or both of them (on the java side, it will be converted to a string). Then each line corresponds to one example and contains the values of the attributes.
If the first attribute name is object and its type is name, then the first column is the object name (otherwise the name will be compute on the fly based on the line number). Comment lines start with a ';' and I think you can add how many blank spaces or lines you want.
For instance, here are the omib database and the zoo database.
Pierre Geurts (p.geurts@ulg.ac.be)