next up previous
Next: Support vector machines Up: Methods Previous: Biological functional classes

Experimental setup

Performance is measured using a three-way cross-validated experiment. The gene expression vectors are randomly divided into three groups. Classifiers are then trained using two groups and tested on the third.

The performance of each classifier is measured by examining how well the classifier identifies the positive and negative examples in the test set. Most of the classification methods return a rank ordering of the test set. Given this ordering and a classification threshold, each gene in the test set can be labeled in one of four ways: false positives are genes that the classifier places within the given class, but MYGD classifies as non-members; false negatives are genes that the classifier places outside the class, but MYGD classifies as members; true positives are class members according to both the classifier and MYGD, and true negatives are non-members according to both. For each method, we find the classification threshold that minimizes the cost function, $fp + 2 \cdot fn$, where fp is the number of false positives, and fn is the number of false negatives. The false negatives are weighted more heavily than the false positives because, for these data, the number of positive examples is small compared to the number of negatives. Results are reported in terms of the false positive and false negative error rates as well as the cost at the minimal classification threshold.

Note that the two decision tree methods do not produce a rank ordering of test set points, making it impossible to vary the classification threshold. Therefore, for the decision tree methods we use the default threshold, rather than the one found by minimizing the cost function.


next up previous
Next: Support vector machines Up: Methods Previous: Biological functional classes
Michael Brown
1999-11-05