Our experiments show the benefits of classifying genes using support vector machines trained on DNA microarray expression data. We begin with a comparison of SVMs versus four non-SVM methods and show that SVMs provide superior performance. We then examine more closely the performance of several different SVMs and demonstrate the superiority of the radial basis function SVM. Finally, we examine in detail some of the apparent errors made by the radial basis function SVM and show that many of the apparent errors are in fact biologically reasonable classifications. Most of the results reported here can be accessed via the web at http://www.cse.ucsc.edu/research/compbio.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
For the data analyzed here, support vector machines provide better
classification performance than the competing classifiers.
Tables 2 and 3 summarize the
results of a three-fold cross-validation experiment using all eight of
the classifiers described in Section 5, including
four SVM variants, Parzen windows, Fisher's linear discriminant and
two decision tree learners. The five columns labeled ``Learned
threshold'' summarize classification performance. In this case, the
method must produce a binary classification label for each member of
the test set. Overall performance of each method is judged using the
cost function,
.
For every class (except the last,
unlearnable class), the best-performing method using the learned
threshold is the radial basis support vector machine. Other cost
functions, with different relative weights of the false positive and
false negative rates, yield similar rankings of performance. These
results are not statistically sufficient to demonstrate unequivocally
that one method is better than the other; however, they do give some
evidence. For example, in five separate tests, the radial basis SVM
performs better than Fisher's linear discriminant. Under the null
hypothesis that the methods are equally good, the probability that the
radial basis SVM would be the best all five times is
1/32=0.03125.
In addition to producing binary classification labels, six of the eight methods produce a ranked list of the test set examples. This ranked list provides more information than the simple binary classification labels. For example, scanning the ranked lists allows the experimenter to easily focus on the genes that lie on the border of the given class. Ranked lists produced by the radial basis SVM for each of the five classes are available at http://www.cse.ucsc.edu/research/compbio/genex. A perfect classifier will place all positive test set examples before the negative examples in the ranked list and will correctly specify the decision boundary to lie between the positives and the negatives. An imperfect classifier, on the other hand, will either produce an incorrect ordering of the test set examples or use an inaccurate classification threshold. Thus, the performance can be improved by fixing either the ranking or the threshold. However, given an improper ranking, no classification threshold can yield perfect performance. Therefore, we focus on finding a correct ranking of the test set. The columns labeled ``Optimized threshold'' in Tables 2 and 3 show the best performance that could be achieved if the classifier were capable of learning the decision threshold perfectly. These results further demonstrate the superior performance of the radial basis SVM: it performs best in four out of five of the learnable classes. Furthermore, the performance of the scaled dot product SVMs improves so that in nearly every class, the best four classifiers are the four SVM methods.
![]() |
As expected, the results also show the inability of these classifiers to learn to recognize the class of genes that produce helix-turn-helix (HTH) proteins. Since helix-turn-helix proteins are not expected to share similar expression profiles, we do not expect any classifier to be capable of learning to recognize this class from gene expression data. Most methods uniformly classify all test set sequences as non-HTHs. The unlearnability of this class is also apparent from a receiver operating characteristic (ROC) analysis of the classification results. Figure 4 shows two ROC curves, which plot the rate of true positives as a function of the rate of false positives as the classification threshold is varied. For a learnable class, such as the genes participating in the tricarboxylic-acid pathway, the false positive sequences cluster close together with respect to the classification threshold. For the HTHs, by contrast, the classification threshold must be varied widely in order to classify all class members as positives. Since the positive class members are essentially random with respect to the classification threshold, the ROC curve shows clearly that this gene class is unlearnable and hence unlikely to be co-regulated.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In addition to demonstrating the superior performance of SVMs relative to non-SVM methods, the results in Tables 2 and 3 indicate that the radial basis SVM performs better than SVMs that use a scaled dot product kernel. In order to verify this difference in performance, we repeated the three-fold cross-validation experiment four more times, using four different random splits of the data. Table 4 summarizes the cost for each SVM on each of the five random splits. The total cost in all five experiments is reported in the final column of the table. The radial basis SVM performs better than the scaled dot product SVMs for all classes except the histones, for which all four methods perform identically. Again, this is not conclusive evidence that the radial basis SVM is superior to the other methods, but it is suggestive.
| ||||||||||||||||||||||||||||||||||||||||||
Besides providing improved support for the claim that the radial basis SVM outperforms the scaled dot product SVMs, repeating the three-fold cross-validation experiment also provides insight into the consistency with which the SVM makes mistakes. A classification error may occur because the MYGD classification actually contains an error; on the other hand, some classification errors may arise simply because the gene is a borderline case, and may or may not appear as an error, depending on how the data is randomly split into thirds. Table 5 summarizes the number of errors that occur consistently throughout the five different experiments. The second column lists the number of genes that a radial basis SVM misclassifies only once in the five experiments. The right-most column lists the number of genes that are consistently misclassified in all five experiments. These latter genes are of much more interest, since their misclassification cannot be attributed to an unlucky split of the data.
|
![]() |
Table 6 lists the 25 genes referred to in the final column of Table 5. These are genes for which the radial basis support vector machine consistently disagrees with the MYGD classification. Many of these disagreements reflect the different perspective provided by the expression data concerning the relationships between genes. The microarray expression data represents the genetic response of the cell to various environmental perturbations, and the SVM classifies genes based on how similar their expression pattern is to genes of known function. The MYGD definitions of functional classes have been arrived at through biochemical experiments that classify gene products by what they do, not how they are regulated. These different perspectives sometimes lead to different functional classifications. For example, in MYGD the members of a complex are defined as what copurifies with the complex, whereas in the expression data a complex is defined by what genes need to be transcribed for proper functioning of the complex. The above example will lead to disagreements between the SVM and MYGD in the form of false positives. Disagreements between the SVM and MYGD in the form of false negatives occur for a number of reasons. First, genes that are classified in MYGD primarily by structure (e.g., protein kinases) may not be similarly classified by the SVM. Second, genes that are regulated at the translational level or protein level, rather than at the trancriptional level measured by the microarray experiments, cannot be correctly classified by expression data alone. Third, genes for which the microarray data is corrupt cannot be correctly classified. Disagreements represent the cases where the different perspectives of the SVM and MYGD lead to different functional classifications and illustrate the new information that expression data brings to biology.