next up previous
Next: Support vector machines Up: Support Vector Machine Classification Previous: Introduction

   
DNA microarray data

Each data point produced by a DNA microarray hybridization experiment represents the ratio of expression levels of a particular gene under two different experimental conditions. An experiment starts with microarray construction, in which several thousand DNA samples are fixed to a glass slide, each at a known position in the array. Each sequence corresponds to a single gene within the organism under investigation. Messenger RNA samples are then collected from a population of cells subjected to various experimental conditions. These samples are converted to cDNA via reverse transcription and are labeled with one of two different fluorescent dyes in the process. A single experiment consists of hybridizing the microarray with two differently labeled cDNA samples collected at different times. Generally, one of the samples is from the reference or background state of the cell, while the other sample represents a special condition set up by the experimenter, for example, heat shock. The level of expression of a particular gene is roughly proportional to the amount of cDNA that hybridizes with the DNA affixed to the slide. By measuring the ratio of each of the two dyes present at the position of each DNA sequence on the slide using laser scanning technology, the relative levels of gene expression for any pair of conditions can be measured [Lashkari et al., 1997,DeRisi et al., 1997]. The result, from an experiment with n DNA samples on a single chip, is a series of nexpression-level ratios. Typically, the numerator of each ratio is the expression level of the gene in the condition of interest to the experimenter, while the denominator is the expression level of the gene in the reference state of the cell.

The data from a series of m such experiments may be represented as a gene expression matrix, in which each of the n rows consists of an m-element expression vector for a single gene. In our experiments the number of experiments m is 79 and the number of genes n is 2467. Following Eisen et al., we do not work directly with the ratio as discussed above but rather with its logarithm[Eisen et al., 1998]. We define Xi to be the logarithm of the ratio of gene X's expression level in experiment i to X's expression level in the reference state. This log ratio is positive if the gene is induced (turned up) with respect to the background and negative if it is repressed (turned down).


  
Figure: Expression profiles of the cytoplasmic ribosomal proteins. Figure (a) shows the expression profiles from the data in [Eisen et al., 1998] of 121 cytoplasmic ribosomal proteins, as classified by MYGD. The logarithm of the expression ratio is plotted as a function of DNA microarray experiment. Ticks along the X-axis represent the beginnings of experimental series. They are, from left to right, cell division cycle after synchronization with $\alpha $ factor arrest (alpha), cell division cycle after synchronization by centrifugal elutriation (elu), cell division cycle measured using a temperature sensitive cdc15 mutant (cdc), sporulation (spo), heat shock (he), reducing shock (re), cold shock (co), and diauxic shift (di). Sporulation is the generation of a yeast spore by meiosis. Diauxic shift is the shift from anaerobic (fermentation) to aerobic (respiration) metabolism. The medium starts rich in glucose, and yeast cells ferment, producing ethanol. When the glucose is used up, they switch to ethanol as a source for carbon. Heat, cold, and reducing shock are various ways to stress the yeast cell. Figure (b) shows the average, plus or minus one standard deviation, of the data in Figure (a).
\begin{figure}\begin{center}
(a) \psfig{figure=ribo.ps,width=4in}\\
(b) \psfig{figure=ribo.ave.ps,width=4in}\\
\end{center}
\end{figure}

The goal of our SVM classifier is to determine accurately the functional class of a given gene based only upon its expression vector ${\bf X}$. Visual inspection of the raw data indicates that such classification should be possible. Figure 1 shows the expression vectors for 121 yeast genes that participate in the cytoplasmic ribosome. The similarities among the expression vectors is clear.

It should be noted that although the mRNA expression vectors in Figure 1 are plotted left-to-right as functions, this is only as a visual convenience. The total mRNA expression data for a gene is not a single times series, but rather a concatenation of different, independent mRNA expression measurements, some of which happen to be clustered in time. Our focus here is on how to analyze large mRNA data sets such as this, which combine information from many unrelated microarray experiments. For this reason we do not explore Fourier transform or other times series oriented feature extraction methods here, e.g. as in [Spellman et al., 1998b], although further preprocessing of the mRNA measurements to remove bad data and reduce the noise in the short time series included in them may have been helpful, and will be considered in future work.


next up previous
Next: Support vector machines Up: Support Vector Machine Classification Previous: Introduction
Michael Brown
1999-11-05