next up previous
Next: Evaluation Up: Massively Parallel Biosequence Analysis Previous: Hidden Markov model sequence

Massively Parallel Implementation

At their simplest level, the Maspar MP-1 and MP-2 computers are array processors with an 8-nearest-neighbor mesh connection and wraparound at the edges [24]. Each PE has 40 32-bit registers, 64 kbytes of locally addressable memory, and a bit-parallel ALU (the MP-1 has 32 4-bit PEs per chip, while the MP-2 has 32 32-bit PEs per chip). Each group of 16 PEs shares a connection to the global router which, as can be expected, is much slower than mesh communication. There is a single array control unit (ACU), a full-fledged processor responsible for broadcasting instructions to the array. Local memory access with a broadcast address is approximately 10 times slower than register access, while memory access with a locally generated address is 25 times slower.

The HMM evaluation algorithm is a 2-phase recursion on three variables, similar to affine sequence comparison. Throughout the algorithm, tex2html_wrap_inline470 probabilities are used to reduce multiplication to addition, and a per-PE table of 7600 integers speeds the addition of probabilities.

In training a model, the model is first compared to every sequence in the training set. This is done by loading the model into the MP-1, one model position per PE, as many times as possible, using recursive doubling.gif Next, a sequence is downloaded to the first processing element of each model via the DMA channel between the host and the PE array, and then the algorithm is initialized. For the forward step, the sequences are stepped through the model (using nearest-neighbor connections) to evaluate and store the forward values, and the sequence is collected again at the end PE. Next, the sequence is sent back through the array, and the probability of being in each model state given the entire sequence is calculated.

To update the model, the average probability distribution over all training sequences for each model node is required. These are generated in the array in parallel.

   figure159
Figure 2: MPL code for forward computation.

Throughout this operation, the transition probabilities and as much other data as possible are kept in the 40 local PE registers. The main loop of the forward calculation is shown in Figure 2.

A typical run of the training algorithm requires 20--60 reestimation cycles, up to a few billion cell update operations, and the process of finding a good model can require ten, one hundred, or more experiments. The training algorithm has a large number of real-valued parameters (heuristic parameters that encourage the model to use match states over insertions or deletions and govern the truncation and growing of the model, default amino acid distributions, default transition probabilities, confidence in the default distributions, and so on), and work is still underway to determine the best settings. Additionally, the annealing schedule is currently quite primitive, and results are sensitive to the random seed. It was obviously impossible to experiment extensively with the parameters on workstations requiring one day of CPU time to perform a single run; the massively parallel implementation has enabled considerable new research.

   table164
Table 2: HMM training on various machines.


next up previous
Next: Evaluation Up: Massively Parallel Biosequence Analysis Previous: Hidden Markov model sequence

Rey Rivera
Tue Jul 30 14:16:55 PDT 1996