next up previous
Next: Hardware for biosequence analysis Up: Introduction Previous: Introduction

Biosequence analysis methods

Perhaps the most commonly desired analysis is the determination of the similarity between two biosequences. Computational biologists have several metrics for comparison which involve different costs between nucleotides as well as the possible use of affine costs or gap penalties to indicate an added penalty for commencing a series of deletions or insertions in the matching [3, 5, 20, 26, 28]. Exact sequence comparison (with or without gaps) is an tex2html_wrap_inline418 dynamic programming algorithm: on a sequential machine, time proportional to the square of the sequence length is required to solve the problem.gif

Distance calculation is governed by a simple recurrence. The cost of transforming a string a into another string b, assuming for convenience that both are of length n, is the solution tex2html_wrap_inline430 of the recurrence:

  eqnarray52

where tex2html_wrap_inline432 is the null character, and tex2html_wrap_inline434 is the cost of not matching tex2html_wrap_inline436 to any character in b. Edit distance, the number of insertions or deletions required to change one sequence to another, can be calculated by setting tex2html_wrap_inline440 , and tex2html_wrap_inline442 when the two characters match, and 2 otherwise.

The recurrence can be efficiently mapped to a parallel processor in several ways, perhaps the most obvious being to assign the i-th column of the dynamic programming matrix, as well as tex2html_wrap_inline436 , to processor PE tex2html_wrap_inline448 . Comparison with gap penalties involves three recurrences of a similar form.

An efficient and widely used statistical heuristic for sequence analysis has been developed by Altschul and colleagues in implemented in the BLAST program [1]. The governing (simple) recurrence is:

eqnarray68

The tex2html_wrap_inline450 values are a table that reflects the evolutionary cost in mutating one amino acid to another (for example, variations on Dayhoff's matrix [9]). This method reduces the cell calculation from 3 minimizations and additions (9 with gaps) to a single addition. Statistical methods are then used to calculate, based on the tex2html_wrap_inline452 diagonal point-mutation scores, a good alignment between sequences. The methods can include a window parameter, roughly corresponding to fixing a maximum gap length.


next up previous
Next: Hardware for biosequence analysis Up: Introduction Previous: Introduction

Rey Rivera
Tue Jul 30 14:16:55 PDT 1996