Pierre Geurts


Decision Tree Applet

This applet allows you to build interactively decision trees and to visualize them. In order to run it, either you use the on-line version or you can download the jar file. This applet requires Java version 1.1.5 at least.



On-line version

Choose one database :

Download

dtapplet.tar.gz (118574 bytes)

This file contains :

A simple example

First, launch the applet with the zoo database. This database contains a hundred of animals, each one described by 17 variables. The first 16 variables are physical characteristics of the animal (whether or not it gives milk, whether or not it has hair,...). The last one is the type (among 7) of this animal (mammal, fish, bird, shellfish, insect, frog or reptile).

Suppose you want to build a tree which tries to predict the type of an animal from its physical characteristics. First, select ``type'' as the goal attribute and all the remaining attributes as candidate attributes (in the top right box). Then choose a learning set. By default, two third of the database (67 animals) is selected at random. To start tree building, click on the button ``new tree''. At this point, either you can completely build the tree in one step by clicking on the ``build'' button or observe the induction step by step by repetitively clicking on the ``split'' button (until there are no more ``open'' node).

To test the resulting tree, select all the cases which do not belong to the learning set by selecting the ``not in ls'' selection mode for the test set (on the bottom of the applet screen) and then click on ``test tree''. The percentage of errors appears in the text box to the right of the ``test tree'' button.


Short documentation

The screen of the applet is divided vertically into four parts.

On the top are the input parameters of the tree :

Database url
an url. Click on the Load Database button to load this database
Goal Attribute
the goal attribute. You can select only one symbolic attribute.
Candidate attributes
This list box allows one not to use some attributes.
.
Entropy Threshold and Inf x LS size Threshold
These parameters determine when the splitting of a node is stopped. The first one is a threshold on the class entropy in the node, the second one is a threshold on the product of the information gain of the best test and the size of the local learning set in this node.
Learning Set
There are three ways of selecting a learning set : random, first (we select the n first examples of the database) and last (we select the last n examples of the database).

When all the input parameters are tuned, click the New tree button to make the top node appear on the center part of the screen. In this part, it's possible to select any node by clicking on it. Then some information about it appears in the field below the canvas.

On the next part of the screen are the controls for the tree building :

Scan attributes
scans all the attributes and compute the best test for each of them in the selected node. The result of the scanning is displayed in the selection box above this line of buttons
Split
splits the current node according to the selected choice in the scanning selection box. If Scan attributes has not been used, the best test is automatically chosen.
Build
builds entirely the tree.
Simple and Detailed
two different ways of displaying a tree. In the latter, each node is represented by a box divided into different vertical color bands, each one representing the proportion of one class in this node.
Refresh
refresh the tree.
Zoom
...
Finally, on the bottom are the test control buttons.
Test set
There are five ways to select a test set : the same three as for the learning set, plus the learning set itself, and all the objects which don't belong to the learning set ('Not in ls').
Test tree
tests the tree even if the building is not achieved.
Cross validation
assess the error rate of the tree by cross validation. Takes into account only the test set size argument. Partitions randomly the learning set into sets of the size of the test set, then for each small set, builds a tree with the remaining objects of the learning set and computes the error rate obtained on this small set. returns the average error rate.


Decision tree method implementation :

A decision tree is a classification model, ie represents a relationship between some numerical or symbolic inputs (the attributes) and one symbolic output. In a decision tree, each interior node is labeled with a test build upon one attribute and each leaf is labeled with a class (one value of the output). To know the output or class associated to some input values, one starts at the top node of the tree and apply sequentially the tests encountered to select the appropriate successor. Finally a unique terminal node is reached and the class stored there is assigned to this case.

A decision tree is build from a set of examples of the relationship (also called objects). We start with only one node which contains all the objects. Then, we recursively split the nodes, selecting at each step an attribute and a test build upon this attribute. This test is selected in order to split the subset corresponding to this node into non-overlapping sets where the objects have as much as possible the same class. To achieve this goal, a score measure is used to assess the tests. Here, the score measure is the information gain. The splitting of a node is stopped in two cases : when the class entropy in this node is low enough, (i.e. is lower that a given threshold. If this threshold is 0, then the splitting is stopped when all the objects belong to the same class) or when the information gain of the best test multiplied by the size of the object set is too low (i.e. again is lower than a given threshold). In the latter case, it means that the best test is not statistically reliable and it's better not to split the node.

Usually, this growing phase is followed by a pruning phase using another validation set but this part of the process is not implemented here.


Database format :

The database format is very simple. The first line is the name of the database. The second line contains the attribute names immediately followed by their types : numerical or symbolic. A numerical attribute can be real or integer (anyway it will be converted to a java float value). A symbolic attribute denotes any discrete attribute. Its value can be symbols, numbers or both of them (on the java side, it will be converted to a string). Then each line corresponds to one example and contains the values of the attributes.

If the first attribute name is object and its type is name, then the first column is the object name (otherwise the name will be compute on the fly based on the line number). Comment lines start with a ';' and I think you can add how many blank spaces or lines you want.

For instance, here are the omib database and the zoo database.


Pierre Geurts (p.geurts@ulg.ac.be)