next up previous
Next: Methods Up: Support Vector Machine Classification Previous: Support vector machines

SVMs for gene expression data

The kernel function acts as a similarity metric between examples in the training set. A simple form of similarity metric is the dot product between two vectors. Previous work [Eisen et al., 1998] has employed a normalized dot product as a similarity metric. Let Xibe the logarithm of the gene expression ratio for gene X in experimental condition i as defined in Section 2. Let the normalized feature vector, ${\bf\bar{X}}$ be defined as

\begin{displaymath}\bar{X}_i = \frac{X_i} {\sqrt{\sum_{j=1}^{m} X_j^2}},
\end{displaymath} (1)

where m is the number of elements in each expression vector. The similarity between two gene expression vectors, ${\bf X}$ and ${\bf
Y}$, for the normalized dot product is defined to be ${\bf\bar{X}}
\cdot {\bf\bar{Y}}$, the dot product on the normalized feature vectors. Eisen et al. use this measure of similiarity to perform hierarchical clustering of genes. We use essentially this same similarity metric as the kernel function in a support vector machine.

Intuitively, one drawback to support vector machine classification is that the classifier is by definition based upon a planar division of the feature space. One can easily imagine a space in which a more complex separating surface more successfully divides family members from non-family members. Through the use of an appropriate kernel function, an SVM can be constructed that produces a separating hyperplane in the feature space that corresponds to a polynomial surface in the input space. This is accomplished by raising the dot product kernel to a positive integer power. Squaring the kernel yields a convex surface in the input space. Raising the kernel to higher powers yields polynomial surfaces of higher degrees. The kernel of degree d is defined by $\left({\bf\bar{X}} \cdot {\bf\bar{Y}}
+1 \right)^d$. In the feature space of this kernel, for any gene Xthere are features for all d-fold interactions between mRNA measurements, represented by terms of the form $\bar{X}_{i_1}
\bar{X}_{i_2} \ldots \bar{X}_{i_d}$, where $1 \leq i_1, \ldots, i_d
\leq 79$. We experiment here with these kernels for degrees d = 1, 2and 3, respectively, denoted below as Dot-product-1, Dot-product-2 and Dot-product-3, resp. The degree one kernel is essentially the normalized dot product kernel, and we also refer to it this way.

In a space in which the positive members of a class form one or more clusters, an accurate classifier might place a Gaussian around each cluster, thereby separating the clusters from the remaining space of non-class members. This effect can be accomplished by placing a small Gaussian over each support vector in the training set. If the width of the Gaussians is chosen well, then the sum of the support vector Gaussians will yield an accurate classifier. This is the technique used by most radial basis function classifiers [Schölkopf et al., 1997]. The formula for the Gaussian, or radial basis function, SVM kernel is

 \begin{displaymath}K({\bf X,Y}) = exp\left(\frac{-\vert\vert{{\bf\bar X}} - {{\bf\bar Y}}\vert\vert^2}{2
\sigma^2}\right),
\end{displaymath} (2)

where $\sigma$ is the width of the Gaussian. In our experiments, $\sigma$ is set equal to the median of the Euclidean distances from each positive training set member to the nearest negative member [Jaakkola et al., 1999].

The functional classes of genes examined here contain very few members relative to the total number of genes in the data set. This imbalance in the number of positive and negative training examples will cause difficulties for any classification method. For SVMs, the benefit gained by including a few class members on the correct side of the hyperplane may be exceeded by the cost associated with that hyperplane due to incorrectly labeled or inaccurately measured negative examples that also appear on the positive side of the hyperplane. In such a situation, when the magnitude of the noise in the negative examples outweighs the total number of positive examples, the optimal hyperplane located by the SVM will be uninformative, classifying all members of the training set as negative examples.

We combat this problem by modifying the matrix of kernel values computed during SVM optimization, as mentioned previously in Section 3. Let ${\bf K}$ be the matrix defined by the kernel function K on the training set; i.e., ${\bf K}_{ij} = K({\bf
X_i},{\bf Y_i})$. By adding to the diagonal of the kernel matrix a constant whose magnitude depends upon the class of the training example, one can control the fraction of misclassified points in the two classes. This technique ensures that the positive points are not regarded as noisy labels. For positive examples, the diagonal element is given by

 \begin{displaymath}{\bf K}[x,x] = K(x,x) + \lambda \frac{n^+}{N},
\end{displaymath} (3)

where n+ is the number of positive training examples, N is the total number of training examples, and $\lambda$ is a scale factor. A similar formula is used for the negative examples, with n+ replaced by n-. In the experiments reported here, the scale factor $\lambda$ is set to 1. When the number of positive examples is small, this technique has the effect of forcing the positive examples to be relatively far from the hyperplane, whereas the negative examples can be closer. In this way, the SVM avoids the uninformative solution of classifying all positive examples as errors. This is discussed in more detail in Appendix A.


next up previous
Next: Methods Up: Support Vector Machine Classification Previous: Support vector machines
Michael Brown
1999-11-05