next up previous
Next: False positives Up: Support Vector Machine Classification Previous: Fisher's linear discriminant

Results and discussion

Our experiments show the benefits of classifying genes using support vector machines trained on DNA microarray expression data. We begin with a comparison of SVMs versus four non-SVM methods and show that SVMs provide superior performance. We then examine more closely the performance of several different SVMs and demonstrate the superiority of the radial basis function SVM. Finally, we examine in detail some of the apparent errors made by the radial basis function SVM and show that many of the apparent errors are in fact biologically reasonable classifications. Most of the results reported here can be accessed via the web at http://www.cse.ucsc.edu/research/compbio.


 
Table 2: Comparison of error rates for various classification methods. Classes are as described in Table 1. The methods are the radial basis function SVM, the SVMs using the scaled dot product kernel raised to the first, second and third power, Parzen windows, Fisher's linear discriminant, and the two decision tree learners, C4.5 and MOC1. The next five columns are the false positive, false negative, true positive and true negative rates summed over three cross-validation splits, followed by the cost, which is the number of false positives plus twice the number of false negatives. These five columns are repeated twice, first using the threshold learned from the training set, and then using the threshold that minimizes the cost on the test set. The threshold optimization is not possible for the decision tree methods, since they do not produce ranked results.
    Learned threshold Optimized threshold
Class Method FP FN TP TN Cost FP FN TP TN Cost
Tricarboxylic acid Radial SVM 8 8 9 2442 24 4 7 10 2446 18
  Dot-product-1 SVM 11 9 8 2439 29 3 6 11 2447 15
  Dot-product-2 SVM 5 10 7 2445 25 5 6 11 2446 17
  Dot-product-3 SVM 4 12 5 2446 28 4 6 11 2446 16
  Parzen 4 12 5 2446 28 0 12 5 2450 24
  FLD 9 10 7 2441 29 7 8 9 2443 23
  C4.5 7 17 0 2443 41 - - - - -
  MOC1 3 16 1 2446 35 - - - - -
Respiration Radial SVM 9 6 24 2428 21 8 4 26 2429 16
  Dot-product-1 SVM 21 10 20 2416 41 6 9 21 2431 24
  Dot-product-2 SVM 7 14 16 2430 35 7 6 24 2430 19
  Dot-product-3 SVM 3 15 15 2434 33 7 6 24 2430 19
  Parzen 22 10 20 2415 42 7 12 18 2430 31
  FLD 10 10 20 2427 30 14 4 26 2423 22
  C4.5 18 17 13 2419 52 - - - - -
  MOC1 12 26 4 2425 64 - - - - -
Ribosome Radial SVM 9 4 117 2337 17 6 1 120 2340 8
  Dot-product-1 SVM 13 6 115 2333 25 11 1 120 2335 13
  Dot-product-2 SVM 7 10 111 2339 27 9 1 120 2337 11
  Dot-product-3 SVM 3 18 103 2343 39 7 1 120 2339 9
  Parzen 6 8 113 2340 22 5 8 113 2341 21
  FLD 15 5 116 2331 25 8 3 118 2338 14
  C4.5 31 21 100 2315 73 - - - - -
  MOC1 26 26 95 2320 78 - - - - -


 
Table 3: Comparison of error rates for various classification methods (continued). See caption for Table 2.
    Learned threshold Optimized threshold
Class Method FP FN TP TN Cost FP FN TP TN Cost
Proteasome Radial SVM 3 7 28 2429 17 4 5 30 2428 14
  Dot-product-1 SVM 14 11 24 2418 36 2 7 28 2430 16
  Dot-product-2 SVM 4 13 22 2428 30 4 6 29 2428 16
  Dot-product-3 SVM 3 18 17 2429 39 2 7 28 2430 16
  Parzen 21 5 30 2411 31 3 9 26 2429 21
  FLD 7 12 23 2425 31 12 7 28 2420 26
  C4.5 17 10 25 2415 37 - - - - -
  MOC1 10 17 18 2422 44 - - - - -
Histone Radial SVM 0 2 9 2456 4 0 2 9 2456 4
  Dot-product-1 SVM 0 4 7 2456 8 0 2 9 2456 4
  Dot-product-2 SVM 0 5 6 2456 10 0 2 9 2456 4
  Dot-product-3 SVM 0 8 3 2456 16 0 2 9 2456 4
  Parzen 2 3 8 2454 8 1 3 8 2455 7
  FLD 0 3 8 2456 6 2 1 10 2454 4
  C4.5 2 2 9 2454 6 - - - - -
  MOC1 2 5 6 2454 12 - - - - -
Helix-turn-helix Radial SVM 1 16 0 2450 33 0 16 0 2451 32
  Dot-product-1 SVM 20 16 0 2431 52 0 16 0 2451 32
  Dot-product-2 SVM 4 16 0 2447 36 0 16 0 2451 32
  Dot-product-3 SVM 1 16 0 2450 33 0 16 0 2451 32
  Parzen 14 16 0 2437 46 0 16 0 2451 32
  FLD 14 16 0 2437 46 0 16 0 2451 32
  C4.5 2 16 0 2449 34 - - - - -
  MOC1 6 16 0 2445 38 - - - - -

For the data analyzed here, support vector machines provide better classification performance than the competing classifiers. Tables 2 and 3 summarize the results of a three-fold cross-validation experiment using all eight of the classifiers described in Section 5, including four SVM variants, Parzen windows, Fisher's linear discriminant and two decision tree learners. The five columns labeled ``Learned threshold'' summarize classification performance. In this case, the method must produce a binary classification label for each member of the test set. Overall performance of each method is judged using the cost function, $fp + (2 \cdot fn)$. For every class (except the last, unlearnable class), the best-performing method using the learned threshold is the radial basis support vector machine. Other cost functions, with different relative weights of the false positive and false negative rates, yield similar rankings of performance. These results are not statistically sufficient to demonstrate unequivocally that one method is better than the other; however, they do give some evidence. For example, in five separate tests, the radial basis SVM performs better than Fisher's linear discriminant. Under the null hypothesis that the methods are equally good, the probability that the radial basis SVM would be the best all five times is 1/32=0.03125.

In addition to producing binary classification labels, six of the eight methods produce a ranked list of the test set examples. This ranked list provides more information than the simple binary classification labels. For example, scanning the ranked lists allows the experimenter to easily focus on the genes that lie on the border of the given class. Ranked lists produced by the radial basis SVM for each of the five classes are available at http://www.cse.ucsc.edu/research/compbio/genex. A perfect classifier will place all positive test set examples before the negative examples in the ranked list and will correctly specify the decision boundary to lie between the positives and the negatives. An imperfect classifier, on the other hand, will either produce an incorrect ordering of the test set examples or use an inaccurate classification threshold. Thus, the performance can be improved by fixing either the ranking or the threshold. However, given an improper ranking, no classification threshold can yield perfect performance. Therefore, we focus on finding a correct ranking of the test set. The columns labeled ``Optimized threshold'' in Tables 2 and 3 show the best performance that could be achieved if the classifier were capable of learning the decision threshold perfectly. These results further demonstrate the superior performance of the radial basis SVM: it performs best in four out of five of the learnable classes. Furthermore, the performance of the scaled dot product SVMs improves so that in nearly every class, the best four classifiers are the four SVM methods.


  
Figure 4: Receiver operating characteristic curves for a learnable and non-learnable class. Each curve plots the rate of true positives as a function of the rate of false positives for varying classification thresholds. Both curves were generated by training a radial basis SVM on two-thirds of the data and testing on the remaining one-third.
\begin{figure}\begin{center}
\begin{tabular}{cc}
\psfig{figure=radial.0.ps,width...
... & (b) Helix-turn-helix \\
\end{tabular}\end{center}
\vspace*{0in}
\end{figure}

As expected, the results also show the inability of these classifiers to learn to recognize the class of genes that produce helix-turn-helix (HTH) proteins. Since helix-turn-helix proteins are not expected to share similar expression profiles, we do not expect any classifier to be capable of learning to recognize this class from gene expression data. Most methods uniformly classify all test set sequences as non-HTHs. The unlearnability of this class is also apparent from a receiver operating characteristic (ROC) analysis of the classification results. Figure 4 shows two ROC curves, which plot the rate of true positives as a function of the rate of false positives as the classification threshold is varied. For a learnable class, such as the genes participating in the tricarboxylic-acid pathway, the false positive sequences cluster close together with respect to the classification threshold. For the HTHs, by contrast, the classification threshold must be varied widely in order to classify all class members as positives. Since the positive class members are essentially random with respect to the classification threshold, the ROC curve shows clearly that this gene class is unlearnable and hence unlikely to be co-regulated.


 
Table 4: Comparison of SVM performance using various kernels. For each of the MYGD classifications, SVMs were trained using four different kernel functions on five different random three-fold splits of the data, training on two-thirds and testing on the remaining third. The first column contains the class, as described in Table 1. The second column contains the kernel function, as described in Table 2. The next five columns contain the threshold-optimized cost (i.e., the number of false positives plus twice the number of false negatives) for each of the five random three-fold splits. The final column is the total cost across all five splits.
Class Kernel Cost for each split Total
Tricarboxylic acid Radial 18 21 15 22 21 97
  Dot-product-1 15 22 18 23 22 100
  Dot-product-2 16 22 17 22 22 99
  Dot-product-3 16 22 17 23 22 100
Respiration Radial 16 18 23 20 16 93
  Dot-product-1 24 24 29 27 23 127
  Dot-product-2 19 19 26 24 23 111
  Dot-product-3 19 19 26 22 21 107
Ribosome Radial 8 12 15 11 13 59
  Dot-product-1 13 18 14 16 16 77
  Dot-product-2 11 16 14 16 15 72
  Dot-product-3 9 15 11 15 15 65
Proteasome Radial 14 10 9 11 11 55
  Dot-product-1 16 12 12 17 19 76
  Dot-product-2 16 13 15 17 17 78
  Dot-product-3 16 13 16 16 17 79
Histone Radial 4 4 4 4 4 20
  Dot-product-1 4 4 4 4 4 20
  Dot-product-2 4 4 4 4 4 20
  Dot-product-3 4 4 4 4 4 20

In addition to demonstrating the superior performance of SVMs relative to non-SVM methods, the results in Tables 2 and 3 indicate that the radial basis SVM performs better than SVMs that use a scaled dot product kernel. In order to verify this difference in performance, we repeated the three-fold cross-validation experiment four more times, using four different random splits of the data. Table 4 summarizes the cost for each SVM on each of the five random splits. The total cost in all five experiments is reported in the final column of the table. The radial basis SVM performs better than the scaled dot product SVMs for all classes except the histones, for which all four methods perform identically. Again, this is not conclusive evidence that the radial basis SVM is superior to the other methods, but it is suggestive.


 
Table 5: Consistency of errors across five different random splits of the data. For each of the MYGD classifications listed in the first column, radial basis SVMs were trained on five different random three-fold splits of the data, training on two-thirds and testing on the remaining third. An entry in column n of the table represents the total number of genes misclassified with respect to the MYGD classification in n of the five random splits. Thus, for example, eight genes were mislabeled in all splits by the SVMs trained on genes from the tricarboxylic-acid pathway.
  Number of splits
Class 1 2 3 4 5
Tricarboxylic-acid pathway 7 2 2 1 8
Respiration chain complexes 9 1 2 4 6
Cytoplasmic ribosomes 5 2 3 2 4
Proteasome 6 0 1 0 5
Histones 0 0 0 0 2

Besides providing improved support for the claim that the radial basis SVM outperforms the scaled dot product SVMs, repeating the three-fold cross-validation experiment also provides insight into the consistency with which the SVM makes mistakes. A classification error may occur because the MYGD classification actually contains an error; on the other hand, some classification errors may arise simply because the gene is a borderline case, and may or may not appear as an error, depending on how the data is randomly split into thirds. Table 5 summarizes the number of errors that occur consistently throughout the five different experiments. The second column lists the number of genes that a radial basis SVM misclassifies only once in the five experiments. The right-most column lists the number of genes that are consistently misclassified in all five experiments. These latter genes are of much more interest, since their misclassification cannot be attributed to an unlucky split of the data.


 
Table 6: Consistently misclassified genes. The table lists all 25 genes that are consistently misclassified by SVMs trained using the MYGD classifications listed in Table 1. Two types of errors are included: a false positive (FP) occurs when the SVM includes the gene in the given class but the MYGD classification does not; a false negative (FN) occurs when the SVM does not include the gene in the given class but the MYGD classification does.
Family Gene Locus Error Description
TCA YPR001W CIT3 FN mitochondrial citrate synthase
  YOR142W LSC1 FN $\alpha $ subunit of succinyl-CoA ligase
  YNR001C CIT1 FN mitochondrial citrate synthase
  YLR174W IDP2 FN isocitrate dehydrogenase
  YIL125W KGD1 FN $\alpha $-ketoglutarate dehydrogenase
  YDR148C KGD2 FN component of $\alpha $-ketoglutarate dehydrogenase
        complex in mitochondria
  YDL066W IDP1 FN mitochondrial form of isocitrate dehydrogenase
  YBL015W ACH1 FP acetyl CoA hydrolase
Resp YPR191W QCR2 FN ubiquinol cytochrome-c reductase core protein 2
  YPL271W ATP15 FN ATP synthase epsilon subunit
  YPL262W FUM1 FP fumarase
  YML120C NDI1 FP mitochondrial NADH ubiquinone 6 oxidoreductase
  YKL085W MDH1 FP mitochondrial malate dehydrogenase
  YDL067C COX9 FN subunit VIIa of cytochrome c oxidase
Ribo YPL037C EGD1 FP $\beta$ subunit of the nascent-polypeptide-associated
        complex (NAC)
  YLR406C RPL31B FN ribosomal protein L31B (L34B) (YL28)
  YLR075W RPL10 FP ribosomal protein L10
  YAL003W EFB1 FP translation elongation factor EF-1$\beta$
Prot YHR027C RPN1 FN subunit of 26S proteasome (PA700 subunit)
  YGR270W YTA7 FN member of CDC48/PAS1/SEC18 family of ATPases
  YGR048W UFD1 FP ubiquitin fusion degradation protein
  YDR069C DOA4 FN ubiquitin isopeptidase
  YDL020C RPN4 FN involved in ubiquitin degradation pathway
Hist YOL012C HTA3 FN histone-related protein
  YKL049C CSE4 FN required for proper kinetochore function


  
Figure 5: Similarity between the average expression profiles of the tricarboxylic-acid pathway and respiration chain complexes. Each series represents the average log expression ratio for all genes in the given family plotted as a function of DNA microarray experiment. Ticks along the X-axis represent the beginnings of experimental series, as described in Figure 1.
\begin{figure}\begin{center}
\psfig{figure=tca-resp.ps,width=4in}\end{center}
\end{figure}

Table 6 lists the 25 genes referred to in the final column of Table 5. These are genes for which the radial basis support vector machine consistently disagrees with the MYGD classification. Many of these disagreements reflect the different perspective provided by the expression data concerning the relationships between genes. The microarray expression data represents the genetic response of the cell to various environmental perturbations, and the SVM classifies genes based on how similar their expression pattern is to genes of known function. The MYGD definitions of functional classes have been arrived at through biochemical experiments that classify gene products by what they do, not how they are regulated. These different perspectives sometimes lead to different functional classifications. For example, in MYGD the members of a complex are defined as what copurifies with the complex, whereas in the expression data a complex is defined by what genes need to be transcribed for proper functioning of the complex. The above example will lead to disagreements between the SVM and MYGD in the form of false positives. Disagreements between the SVM and MYGD in the form of false negatives occur for a number of reasons. First, genes that are classified in MYGD primarily by structure (e.g., protein kinases) may not be similarly classified by the SVM. Second, genes that are regulated at the translational level or protein level, rather than at the trancriptional level measured by the microarray experiments, cannot be correctly classified by expression data alone. Third, genes for which the microarray data is corrupt cannot be correctly classified. Disagreements represent the cases where the different perspectives of the SVM and MYGD lead to different functional classifications and illustrate the new information that expression data brings to biology.



 
next up previous
Next: False positives Up: Support Vector Machine Classification Previous: Fisher's linear discriminant
Michael Brown
1999-11-05