next up previous
Next: Conclusions and future work Up: Results and discussion Previous: False positives

False negatives


  
Figure 7: Expression profiles of two false negative genes for the proteasome class. Each figure shows the expression profile for a single gene, along with standard deviation bars for the proteasome class. Ticks along the X-axis represent the beginnings of experimental series, as described in Figure 1.
\begin{figure}\begin{center}
\begin{tabular}{cc}
\psfig{figure=ydl020c.ps,width=...
...) YDL020C & (b) YDR069C \\
\end{tabular}\end{center}
\vspace*{0in}
\end{figure}

Some of the false negatives produced by the support vector machine occur when a protein that was assigned to a functional class in MYGD based on structural similarity has a special function that demands a different regulation strategy. For example, YKL049C is classified as a histone protein by MYGD based on its 61% amino acid similarity with histone protein H3. YKL049C is thought to act as part of the centromere [Stoler et al., 1995], and while it is related to histones, the expression data shows that it is not coregulated with other histone genes. Therefore, the SVM does not assign YLK049C to the histone class. A similar situation arises in the proteasome class. Both YDL020C and YDR069C are physically associated with the proteasome [Fujimuro et al., 1998,Papa et al., 1999]. However, these proteins are not intrinsic subunits of the proteasome, but are loosly associated auxiliary factors [Glickman et al., 1998,Papa et al., 1999]. The SVM does not classify them as belonging to the proteasome because they are regulated differently from the rest of the proteasome during sporulation, as shown in Figure 7.

One limitation inherent in the use of gene expression data to identify genes that function together is that some genes are regulated primarily at the translational and protein levels. For example, six of the seven cases in which the SVM was unable to assign members of the TCA class are genes encoding citrate synthase, isocitrate dehydrogenase of $\alpha $-ketoglutarate dehydrogenase. The enzymatic activities of these proteins are known to be regulated allosterically by ADP/ATP, succinyl-CoA, and NAD+/NADPH [Garrett and Grisham, 1995, pp. 619-622]. These enzymes are regulated primarily by means that do not involve changes in mRNA level. Thus, the SVM will not be able to correctly classify them by expression data alone.

Other discrepancies appear to be caused by corrupt data. For example, the SVM classifies YLR075W as a cytoplasmic ribosomal protein, but MYGD does not. YLR075W is in fact a ribosomal protein [Wool et al., 1995,Dick et al., 1997]. The similarity between the YLR075W expression profile and the profile of the cytoplasmic ribosomal proteins is evident in Figure 6(b). This discrepency is an oversight in MYGD, which has since been corrected [Mannhaupt, 1999]. Other errors occur in the expression data itself. Occasionally, the microarrays contain bad probes or are damaged, and some locations in the gene expression matrix are marked as containing corrupt data. Three of the genes listed in Table 6 (YDL067C, YOR142W and YHR027C) are marked as such. In addition, although the SVM correctly assigns YDL075W to the ribosomal protein class, YLR406C, essentially a duplicate copy of YDL075W is not assigned to that class. The microarrays are not sensitive enough to differentiate between two such similar genes; therefore, it is likely that the YLR406C data is also questionable. The profile for this gene is shown in Figure 6(d).

No immediate explanation is available for the discrepancies involving the remaining six genes. These genes include one false positive TCA (YBL015W), two false negative respiration chain complexes (YPR191W and YPL271W), a false negative proteasome (YGR270W), and a false negative histone (YOL012C). Further experiments would be required to determine whether these misclassifications are artifacts or are clues to the genuine biological role of these proteins.


 
Table 7: The magnitude of the training set weights predicts outliers. The average weight of each gene was computed across five three-fold cross-validation tests of the radial basis SVM trained on the cytoplasmic ribosomes, and the genes were ranked accordingly. The table shows the ten negative examples with the largest weights, their average weights, and the total number of times (out of five) that each gene was misclassified.
Gene Weight Errors
YLR075W 2.093 5
YOR276W 1.016 4
YNL209W 0.977 4
YAL003W 0.930 5
YPL037C 0.833 5
YKR059W 0.815 2
YML106W 0.791 1
YDR385W 0.771 2
YPR187W 0.767 1
YJL138C 0.757 3

The misclassified genes described in Table 6 were found by classifying the data using trained SVMs and identifying errors. However, many of these outlier genes could have been identified during the training phase. Genes that are misclassified in the training set are likely to be outliers with respect to their labeled class. Consequently, these genes will violate the soft margin of the SVM and will hence receive large weights ($\alpha_i$ in the formulation of Section 5.4). Table 7 shows the ten largest average weights for negative training set examples from the cytoplasmic ribosome class. As expected, these examples are the ones most often misclassified by the trained SVMs. The information in Table 7 could have been used to perform data cleaning, automatically removing inaccurate classifications from the training set [Guyon et al., 1996]. Such a procedure would have removed from the training data the mislabeled gene YLR075W.


next up previous
Next: Conclusions and future work Up: Results and discussion Previous: False positives
Michael Brown
1999-11-05