Next: Biosequence analysis methods
Up: Massively Parallel Biosequence Analysis
Previous: Massively Parallel Biosequence Analysis
The goal of the Human Genome Project is to decode and understand human
genetic information [4, 10, 29]. Biosequence
databases currently contain on the order of
characters in
full and fragmentary biosequences; massive parallelism is
required to fully analyze this vast, rapidly growing store of
information.
Biologically, the DNA (deoxyribonucleic acid) located in a cell
encodes the structure of an organism in thousands or millions of
nucleotides -- three billion in the case of Homo sapiens.
RNA (ribonucleic acid) molecules are generally shorter, containing up
to a few thousand nucleotides. A nucleic acid sequence may encode one
or more amino acid sequences, each sequence being a protein. Each set
of three consecutive nucleotides specifies an amino acid to be added
to the protein. The proteins are generated from the DNA with a
complex chemical decryption algorithm involving the RNA. Proteins
tend to be about 1000 amino acids long.
There are many tasks that the computational biologist would like to
perform on biosequences. A few that are relevant to this discussion
include:
- Sequence comparison
- Determine how similar, by some
biologically-significant comparison function, two sequences are. This
can suggest sequences for further study or analysis. A basic dynamic
programming sequence comparison method, described below, requires
time, where N is the length of the sequences.
- Sequence classification
- Given a sequence determined in the
lab, decide its type, for example whether or not it is a globin such as
hemoglobin.
- Sequence alignment
- Find the best alignment between two
sequences. That is, which parts of one sequence correspond to which
parts of the other. If several important regions of the first have
been experimentally determined, sequence alignment will ideally
locate those regions in other sequences. Sequence alignment
information can be extracted from most sequence comparison
algorithms. Regions of the sequences that vary little between
sequences or organisms are referred to as ``conserved'' regions.
- Multiple sequence alignment
- Find the best alignment among a
group of sequences [5]. This differs considerably from pairwise
sequence alignment, and can require
time for M sequences,
if straight-forward dynamic programming is used. Thus,
heuristics using pairwise alignments are more commonly used.
Multiple alignments can be instrumental
for creating informed guesses about sequence function.
- Subsequence variations
- The above problems can also be applied
to finding and aligning the most similar length-
subsequences of
two or more sequences, or finding an optimal value for
under
some criterion.
An ultimate goal of biosequence analysis (with shades of El Dorado) is
to determine a molecules' 3-dimensional structure from its
1-dimensional sequence of nucleotides or amino acids. A somewhat more
practical task is to determine the secondary, or 2-dimensional, structure
of a biosequence. While secondary structure is a simpler energy
minimization problem than tertiary, the complexity of the
energy functions and length of the molecules have steered research
toward neural network pattern recognition of sequence segments that
form particular structures. Current results, however, have only
achieved around 60% accuracy in prediction [7]. Multiple
sequence alignments can form a starting point for 2-D and 3-D
structure prediction when crystallographic data is available for one
or more of the aligned sequences.
Next: Biosequence analysis methods
Up: Massively Parallel Biosequence Analysis
Previous: Massively Parallel Biosequence Analysis
Rey Rivera
Tue Jul 30 14:16:55 PDT 1996