next up previous
Next: Biosequence analysis methods Up: Massively Parallel Biosequence Analysis Previous: Massively Parallel Biosequence Analysis

Introduction

The goal of the Human Genome Project is to decode and understand human genetic information [4, 10, 29]. Biosequence databases currently contain on the order of tex2html_wrap_inline402 characters in tex2html_wrap_inline404 full and fragmentary biosequences; massive parallelism is required to fully analyze this vast, rapidly growing store of information.

Biologically, the DNA (deoxyribonucleic acid) located in a cell encodes the structure of an organism in thousands or millions of nucleotides -- three billion in the case of Homo sapiens. RNA (ribonucleic acid) molecules are generally shorter, containing up to a few thousand nucleotides. A nucleic acid sequence may encode one or more amino acid sequences, each sequence being a protein. Each set of three consecutive nucleotides specifies an amino acid to be added to the protein. The proteins are generated from the DNA with a complex chemical decryption algorithm involving the RNA. Proteins tend to be about 1000 amino acids long.

There are many tasks that the computational biologist would like to perform on biosequences. A few that are relevant to this discussion include:

Sequence comparison
Determine how similar, by some biologically-significant comparison function, two sequences are. This can suggest sequences for further study or analysis. A basic dynamic programming sequence comparison method, described below, requires tex2html_wrap_inline406 time, where N is the length of the sequences.
Sequence classification
Given a sequence determined in the lab, decide its type, for example whether or not it is a globin such as hemoglobin.
Sequence alignment
Find the best alignment between two sequences. That is, which parts of one sequence correspond to which parts of the other. If several important regions of the first have been experimentally determined, sequence alignment will ideally locate those regions in other sequences. Sequence alignment information can be extracted from most sequence comparison algorithms. Regions of the sequences that vary little between sequences or organisms are referred to as ``conserved'' regions.
Multiple sequence alignment
Find the best alignment among a group of sequences [5]. This differs considerably from pairwise sequence alignment, and can require tex2html_wrap_inline410 time for M sequences, if straight-forward dynamic programming is used. Thus, heuristics using pairwise alignments are more commonly used. Multiple alignments can be instrumental for creating informed guesses about sequence function.
Subsequence variations
The above problems can also be applied to finding and aligning the most similar length- tex2html_wrap_inline414 subsequences of two or more sequences, or finding an optimal value for tex2html_wrap_inline414 under some criterion.

An ultimate goal of biosequence analysis (with shades of El Dorado) is to determine a molecules' 3-dimensional structure from its 1-dimensional sequence of nucleotides or amino acids. A somewhat more practical task is to determine the secondary, or 2-dimensional, structure of a biosequence. While secondary structure is a simpler energy minimization problem than tertiary, the complexity of the energy functions and length of the molecules have steered research toward neural network pattern recognition of sequence segments that form particular structures. Current results, however, have only achieved around 60% accuracy in prediction [7]. Multiple sequence alignments can form a starting point for 2-D and 3-D structure prediction when crystallographic data is available for one or more of the aligned sequences.




next up previous
Next: Biosequence analysis methods Up: Massively Parallel Biosequence Analysis Previous: Massively Parallel Biosequence Analysis

Rey Rivera
Tue Jul 30 14:16:55 PDT 1996