next up previous
Next: Parzen windows Up: Methods Previous: Support vector machines

Decision trees

We compare the performance of the SVM classifiers described above with that of four other classifiers. The first two are decision tree classifiers. Decision trees are a standard tool in data mining, and many are available in packages such as CART [Breiman et al., 1984] and C4.5 [Quinlan, 1997]. Decision trees are generally preferred over other nonparametric techniques because of the readability of their learned hypotheses and the efficiency of training and evaluation.

Decision trees are rooted, usually binary trees, with simple classifiers placed at each internal node and a classification at each leaf. In order to evaluate a particular tree T with respect to an input x, each classifier in the tree is assigned the argument x. The outputs of the simple classifiers at the nodes determine a unique path from the root to a leaf of the decision tree: at each internal node, the left edge to a child is taken if the output of the function associated with that internal node is +1, and vice versa if it is -1. This path is known as the evaluation path. The value of the function T(x) is the classification associated with the leaf reached by the evaluation path.

Decision trees are generally learned by means of a top down growth procedure, which starts from the root node and greedily chooses a split of the data that maximizes some cost function, usually a measure of the ``impurity'' of the subsamples implicitly defined by the split. After choosing a split, the subsamples are then mapped to the two children nodes. This procedure is then recursively applied to the children, and the tree is grown until some stopping criterion is met. The tree is then used as a starting point for a bottom up search, performing a pruning of the tree. This eliminates nodes that are redundant or are unable to ``pay for themselves'' in terms of the cost function.

Typically, the simple classifier at an internal node compares one of the input attributes against a threshold. This test partitions the input space with axis parallel splits. The standard algorithm of this kind is C4.5 [Quinlan, 1997]. Another strategy uses hyperplanes in general position. This is the technique adopted by systems like OC1. We use an improved version of OC1, called MOC1, which implements a bias toward large margin splits in the purity measure, theoretically motivated by Vapnik-Chervonenkis theory. MOC1 has been shown to outperform the standard OC1 [Wu et al., 1999].

We use the systems C4.5 and MOC1 in their basic version with all the default settings. Note that it would be possible to devise modifications of the same systems to account, for example, for the unequal numbers of positive and negative training examples, which might improve their performance.


next up previous
Next: Parzen windows Up: Methods Previous: Support vector machines
Michael Brown
1999-11-05