At their simplest level, the Maspar MP-1 and MP-2 computers are array processors with an 8-nearest-neighbor mesh connection and wraparound at the edges [24]. Each PE has 40 32-bit registers, 64 kbytes of locally addressable memory, and a bit-parallel ALU (the MP-1 has 32 4-bit PEs per chip, while the MP-2 has 32 32-bit PEs per chip). Each group of 16 PEs shares a connection to the global router which, as can be expected, is much slower than mesh communication. There is a single array control unit (ACU), a full-fledged processor responsible for broadcasting instructions to the array. Local memory access with a broadcast address is approximately 10 times slower than register access, while memory access with a locally generated address is 25 times slower.
The HMM evaluation algorithm is a 2-phase recursion on three
variables, similar to affine sequence comparison. Throughout the
algorithm,
probabilities are used to reduce multiplication to
addition, and a per-PE table of 7600 integers speeds the addition of
probabilities.
In training a model, the model is first compared to every sequence
in the training set. This is done by loading the model into the MP-1,
one model position per PE, as many times as possible, using recursive
doubling.
Next, a
sequence is downloaded to the first processing element of each model
via the DMA channel between the host and the PE array, and then the
algorithm is initialized. For the forward step, the
sequences are stepped through the model (using nearest-neighbor
connections) to evaluate and store the forward values, and the
sequence is collected again at the end PE. Next, the sequence is
sent back through the array, and the probability of being
in each model state given the entire sequence is calculated.
To update the model, the average probability distribution over all training sequences for each model node is required. These are generated in the array in parallel.
Figure 2: MPL code for forward computation.
Throughout this operation, the transition probabilities and as much other data as possible are kept in the 40 local PE registers. The main loop of the forward calculation is shown in Figure 2.
A typical run of the training algorithm requires 20--60 reestimation cycles, up to a few billion cell update operations, and the process of finding a good model can require ten, one hundred, or more experiments. The training algorithm has a large number of real-valued parameters (heuristic parameters that encourage the model to use match states over insertions or deletions and govern the truncation and growing of the model, default amino acid distributions, default transition probabilities, confidence in the default distributions, and so on), and work is still underway to determine the best settings. Additionally, the annealing schedule is currently quite primitive, and results are sensitive to the random seed. It was obviously impossible to experiment extensively with the parameters on workstations requiring one day of CPU time to perform a single run; the massively parallel implementation has enabled considerable new research.
Table 2: HMM training on various machines.