Perhaps the most commonly desired analysis is the determination of the
similarity between two biosequences.
Computational biologists have several metrics for
comparison which involve different costs between nucleotides as well
as the possible use of affine costs or gap penalties to indicate an
added penalty for commencing a series of deletions or insertions in
the matching [3, 5, 20, 26, 28].
Exact sequence comparison
(with or without gaps) is an
dynamic programming algorithm:
on a sequential
machine, time proportional to the square of the sequence length is
required to solve the problem.
Distance calculation is governed by a simple recurrence.
The cost of transforming a string a into another string b,
assuming for convenience that both are of length n, is the solution
of the recurrence:
where
is the null character, and
is the cost
of not matching
to any character in b.
Edit distance, the number of insertions or deletions required to
change one sequence to another, can be calculated by setting
, and
when the
two characters match, and 2 otherwise.
The recurrence can be efficiently mapped to a parallel processor in
several ways, perhaps the most obvious being to assign the i-th
column of the dynamic programming matrix, as well as
, to
processor PE
. Comparison with gap penalties involves three
recurrences of a similar form.
An efficient and widely used statistical heuristic for sequence analysis has been developed by Altschul and colleagues in implemented in the BLAST program [1]. The governing (simple) recurrence is:
The
values are a table that reflects the evolutionary cost in
mutating one amino acid to another (for example, variations on Dayhoff's
matrix [9]). This method reduces the cell calculation
from 3 minimizations and additions (9 with gaps) to a single
addition. Statistical methods are then used to calculate, based on
the
diagonal point-mutation scores, a good alignment between
sequences. The methods can include a window parameter, roughly
corresponding to fixing a maximum gap length.