next up previous
Next: Massively Parallel Implementation Up: Massively Parallel Biosequence Analysis Previous: Hardware for biosequence analysis

Hidden Markov model sequence analysis

An evolving and exciting means of biosequence analysis developed at UCSC uses a hidden Markov model (HMM) to generate position-dependent cost tables for each amino acid. Given a set of biologically-similar sequences (such as the globins), the statistical model can be iteratively trained to become a close-as-possible match to all the sequences. Thus, the HMM is a probabilistic consensus sequence that contains a probability distribution over all amino acids at each node.

The protein HMM is displayed in Figure 1 (the reader is referred to the tutorial by Rabiner for an introduction to HMMs [25]). The squares are match states, corresponding to the matching of the model state to a character from the input sequence. The diamonds are insertion states, corresponding to characters in the sequence that are not matched. The circles are delete states, corresponding to skipping over a model node when aligning the sequence to the model. The connecting arrows are transition probabilities between nodes. The zeroth state is a special begin state, and the last is a special end state (real models are much longer than the one shown in the figure). In calculating the distance between the sequence and the model, one calculates the probability that the HMM could generate the given sequence. In aligning a sequence to a model, the most probable generation of that sequence is assumed.

For information on training and updating the HMM, the reader is referred to the UCSC technical report and the related, condensed conference paper [13, 14]. Suffice to say, the vast majority of training time is spent calculating the probability of entering each state with each character of a training sequence. This is solved two-step forward-backward dynamic programming algorithm implemented on a massively parallel array processor.



Rey Rivera
Tue Jul 30 14:16:55 PDT 1996