Each data point produced by a DNA microarray hybridization experiment represents the ratio of expression levels of a particular gene under two different experimental conditions. An experiment starts with microarray construction, in which several thousand DNA samples are fixed to a glass slide, each at a known position in the array. Each sequence corresponds to a single gene within the organism under investigation. Messenger RNA samples are then collected from a population of cells subjected to various experimental conditions. These samples are converted to cDNA via reverse transcription and are labeled with one of two different fluorescent dyes in the process. A single experiment consists of hybridizing the microarray with two differently labeled cDNA samples collected at different times. Generally, one of the samples is from the reference or background state of the cell, while the other sample represents a special condition set up by the experimenter, for example, heat shock. The level of expression of a particular gene is roughly proportional to the amount of cDNA that hybridizes with the DNA affixed to the slide. By measuring the ratio of each of the two dyes present at the position of each DNA sequence on the slide using laser scanning technology, the relative levels of gene expression for any pair of conditions can be measured [Lashkari et al., 1997,DeRisi et al., 1997]. The result, from an experiment with n DNA samples on a single chip, is a series of nexpression-level ratios. Typically, the numerator of each ratio is the expression level of the gene in the condition of interest to the experimenter, while the denominator is the expression level of the gene in the reference state of the cell.
The data from a series of m such experiments may be represented as a gene expression matrix, in which each of the n rows consists of an m-element expression vector for a single gene. In our experiments the number of experiments m is 79 and the number of genes n is 2467. Following Eisen et al., we do not work directly with the ratio as discussed above but rather with its logarithm[Eisen et al., 1998]. We define Xi to be the logarithm of the ratio of gene X's expression level in experiment i to X's expression level in the reference state. This log ratio is positive if the gene is induced (turned up) with respect to the background and negative if it is repressed (turned down).
![]() |
The goal of our SVM classifier is to determine accurately the
functional class of a given gene based only upon its expression vector
.
Visual inspection of the raw data indicates that such
classification should be possible. Figure 1
shows the expression vectors for 121 yeast genes that participate in
the cytoplasmic ribosome. The similarities among the expression
vectors is clear.
It should be noted that although the mRNA expression vectors in Figure 1 are plotted left-to-right as functions, this is only as a visual convenience. The total mRNA expression data for a gene is not a single times series, but rather a concatenation of different, independent mRNA expression measurements, some of which happen to be clustered in time. Our focus here is on how to analyze large mRNA data sets such as this, which combine information from many unrelated microarray experiments. For this reason we do not explore Fourier transform or other times series oriented feature extraction methods here, e.g. as in [Spellman et al., 1998b], although further preprocessing of the mRNA measurements to remove bad data and reduce the noise in the short time series included in them may have been helpful, and will be considered in future work.