We have demonstrated that support vector machines can accurately classify genes into functional categories based upon expression data from DNA microarray hybridization experiments. Among the techniques that we examined, the SVM that uses a radial basis kernel function provides the best performance--better than Parzen windows, Fisher's linear discriminant, two decision tree classifiers, and three other SVMs that use a scaled dot product kernel. These results were generated in a supervised fashion, as opposed to the unsupervised clustering algorithms that have been previously considered [Eisen et al., 1998,Tamayo et al., 1999]. The supervised learning framework allows a researcher to start with a set of interesting genes and ask two questions: What other genes are related to my set, and Does my set contain genes that do not belong? Furthermore, the support vectors identified by the SVM effectively define the boundary of the training set of genes. This ability to focus on the few informative genes out of the vast landscape of uninformative genes is fundamental to making scientific insight.
The experiments reported here were performed using only expression data for genes that already have functional annotation. The expression data for the remaining S. cerevisiae genes are not currently available for all experimental conditions. If the data were available, the SVMs produced here would undoubtedly identify among the unannotated genes additional members of the five MYGD classes.
One significant benefit offered by SVMs is scalability. The number of support vectors selected by the SVM learning algorithm is usually small, even for very large training sets, and the resulting SVM is consequently an efficient classifier. In this work, training a radial basis SVM using two-thirds of the data set (1645 examples) takes an average of 89.5 CPU seconds on a DEC Alpha 4100 workstation running at 466 MHz. The resulting machine contains only 216 support vectors on average. Thus, classifying a new gene requires comparisons with only approximately 13.1% of the training set.
Scalability is essential because the amount of available gene expression data will soon increase dramatically. We will have larger training sets that include more genes and more detailed expression profiles. When hundreds, and soon thousands, of mRNA expressions measurements under different conditions become available for each gene, each measurement will still, by itself, give only partial and inconclusive information about any given functional classification of the gene. However, all these different mRNA measurements taken together may often provide enough information to classify the gene with very high confidence. This is much like the process whereby many observations of the same underlying signal plus independent noise can, via the central limit theorem, be reduced to one highly reliable observation of the underlying signal.
In addition to large quantities of mRNA expression data, SVMs are capable of using data about genes from other sources. Our current work uses only DNA microarray expression data, but similar SVMs could be constructed using other gene features, such as the presence of transcription factor binding sites in the promoter region or sequence features of the translated protein, e.g. as in [Jaakkola et al., 1999]. We have begun working with SVMs that classify using training vectors concatenated from multiple sources using the methods from [Jaakkola et al., 1999,Jaakkola and Haussler, 1998].
We have described some of the issues involved in selecting an appropriate kernel function. Although the simple radial basis kernel function provides excellent performance, a better SVM could be constructed that incorporates prior knowledge about the classification domain. One avenue for such research involves kernels that explicitly account for dependencies among elements in the expression profiles. Such dependencies would arise, for example, in a data set constructed from a series of microarray experiments with sufficiently small time steps between samples. In this context, we have experimented with a modified version of the dot product kernel that interposes an inverted covariance matrix betwen the two expression vectors. This kernel yields performance that is intermediate between the dot product kernel and the radial basis kernel; however, computing and inverting the covariance matrix is computationally expensive.
Any supervised learning algorithm requires a gold standard classification that will function as the teacher signal. Here, we have used MYGD classifications as our gold standard. The MYGD classifications are undoubtedly incomplete and may be biased, but they are the best classifications available, given the currently limited knowledge about functional classes and how those classes are reflected in microarray expression data. We have shown that SVMs can learn to recognize functional classes even from a noisy teacher signal provided by the MYGD classifications. Furthermore, the magnitudes of the optimized weights, as well as the discriminant values of the training set, provide accurate indicators of outliers in the training set that are likely to have been misclassified by the teacher signal. The ability to identify outliers suggests a bootstrapping approach, in which an initially noisy gold standard classification is refined by the SVM [Guyon et al., 1996]. This bootstrapping method could be applied to classifications learned via unsupervised methods, such as those of [Eisen et al., 1998] and [Tamayo et al., 1999].
Similarity and distance metrics play a fundamental role in both supervised and unsupervised methods for the analysis of mRNA expression data and for other pattern recognition problems. For example, Eisen et al. cluster mRNA expression vectors using hierarchical agglomerative clustering with a Pearson correlation coefficient similarity metric. Tamayo et al. cluster gene expression data using self-organizing maps [Tamayo et al., 1999], which rely on a distance metric.
Any similarity metric defined by kernel function
,
or equivalently, any positive definite function, can be converted into
a distance function
via the equation
.
The resulting distance function will always correspond
to a true distance function, either in Euclidean N-dimensional
feature space for some N, or in infinite dimensional space
[Berg et al., 1984]. Thus any kernel can be used in an unsupervised
pattern recognition method as well as in a supervised pattern
recognition method like SVMs, so long as that method relies only on
distance (or similarity) calculations, and not on explicit
construction of the feature space (see, e.g.,
[Scholkopf et al., 1999]). This may be a fruitful area for further
research.
Finally, we note that a number of researchers have analyzed DNA microarray gene expression data with the goal of reconstructing complete regulatory pathways within the cell. The work we have presented makes no attempt to infer pathways, nor is it obvious how SVM methods could be applied to this much more complex task. However, the functional classification of genes is a prerequisite to reconstructing complete pathways, and so this work does contribute toward this goal. Using only the limited features that were available to us, we have barely scratched the surface of this problem, but the potential is significant.
Acknowledgments
The authors would like to thank Tommi Jaakkola for assistance in the development of the SVM software. Michael P. S. Brown is supported by a PMMB Burroughs Wellcome Predoctoral Fellowship. William N. Grundy is supported by a Sloan/DOE Fellowship in Computational Molecular Biology. David Haussler is supported by DOE grant DE-FG03-95ER62112 and NSF grant DBI-9808007 to David Haussler and NIH grant CA 77813 to David Haussler and Manuel Ares.