![]() |
Some of the false negatives produced by the support vector machine occur when a protein that was assigned to a functional class in MYGD based on structural similarity has a special function that demands a different regulation strategy. For example, YKL049C is classified as a histone protein by MYGD based on its 61% amino acid similarity with histone protein H3. YKL049C is thought to act as part of the centromere [Stoler et al., 1995], and while it is related to histones, the expression data shows that it is not coregulated with other histone genes. Therefore, the SVM does not assign YLK049C to the histone class. A similar situation arises in the proteasome class. Both YDL020C and YDR069C are physically associated with the proteasome [Fujimuro et al., 1998,Papa et al., 1999]. However, these proteins are not intrinsic subunits of the proteasome, but are loosly associated auxiliary factors [Glickman et al., 1998,Papa et al., 1999]. The SVM does not classify them as belonging to the proteasome because they are regulated differently from the rest of the proteasome during sporulation, as shown in Figure 7.
One limitation inherent in the use of gene expression data to identify
genes that function together is that some genes are regulated
primarily at the translational and protein levels. For example, six
of the seven cases in which the SVM was unable to assign members of
the TCA class are genes encoding citrate synthase, isocitrate
dehydrogenase of
-ketoglutarate dehydrogenase. The enzymatic
activities of these proteins are known to be regulated allosterically
by ADP/ATP, succinyl-CoA, and NAD+/NADPH
[Garrett and Grisham, 1995, pp. 619-622]. These enzymes are
regulated primarily by means that do not involve changes in mRNA
level. Thus, the SVM will not be able to correctly classify them by
expression data alone.
Other discrepancies appear to be caused by corrupt data. For example, the SVM classifies YLR075W as a cytoplasmic ribosomal protein, but MYGD does not. YLR075W is in fact a ribosomal protein [Wool et al., 1995,Dick et al., 1997]. The similarity between the YLR075W expression profile and the profile of the cytoplasmic ribosomal proteins is evident in Figure 6(b). This discrepency is an oversight in MYGD, which has since been corrected [Mannhaupt, 1999]. Other errors occur in the expression data itself. Occasionally, the microarrays contain bad probes or are damaged, and some locations in the gene expression matrix are marked as containing corrupt data. Three of the genes listed in Table 6 (YDL067C, YOR142W and YHR027C) are marked as such. In addition, although the SVM correctly assigns YDL075W to the ribosomal protein class, YLR406C, essentially a duplicate copy of YDL075W is not assigned to that class. The microarrays are not sensitive enough to differentiate between two such similar genes; therefore, it is likely that the YLR406C data is also questionable. The profile for this gene is shown in Figure 6(d).
No immediate explanation is available for the discrepancies involving the remaining six genes. These genes include one false positive TCA (YBL015W), two false negative respiration chain complexes (YPR191W and YPL271W), a false negative proteasome (YGR270W), and a false negative histone (YOL012C). Further experiments would be required to determine whether these misclassifications are artifacts or are clues to the genuine biological role of these proteins.
|
The misclassified genes described in Table 6 were
found by classifying the data using trained SVMs and identifying
errors. However, many of these outlier genes could have been
identified during the training phase. Genes that are misclassified in
the training set are likely to be outliers with respect to their
labeled class. Consequently, these genes will violate the soft margin
of the SVM and will hence receive large weights (
in the
formulation of Section 5.4).
Table 7 shows the ten largest average weights for
negative training set examples from the cytoplasmic ribosome class.
As expected, these examples are the ones most often misclassified by
the trained SVMs. The information in Table 7 could
have been used to perform data cleaning, automatically removing
inaccurate classifications from the training set
[Guyon et al., 1996]. Such a procedure would have removed from the
training data the mislabeled gene YLR075W.