To evaluate the effectiveness of the massively parallel implementation (as well as gain familiarity with parts of the system outside the dynamic programming inner loop), the author has been studying elongation factors, a biologically interesting protein structure [11]. One particularly important aid has been the availability of a structural alignment of 3 members of the class from X-ray crystallographic data. This alignment provides a ``sanity check'' on the trained HMMs.
Table 3: Elongation factor training set statistics.
Another interesting feature of the elongation factors is the large variations in the length of the protein sequences, seen in Table 3. (The two shortest, 12 and 79 amino acids, were fragments and eliminated from the training set.) The conserved region of the sequences is several hundred amino acids long, and models of 400 to 500 positions worked best. Unlike earlier experiments at UCSC, these sequences were not clipped to the region of importance. Instead, free insertion modules (FIMs), allowing low-penalty insertions, were prepended and appended to the HMM before training to convert to a subsequence modeling program. The HMM rapidly converged to an alignment of the conserved region.
Figure 3: Alignment of four proteins with elongation factors.
A large number of experiments, each with several random seeds, was performed. The most important parameter for generating a good HMM was the model length (the best alignment had a model length of 401), though other parameters were also varied. In all, about one hundred experiments were performed, each requiring around 5 minutes of 16K MP-1 CPU time (equivalent to about 5 hours on Sparc-2 workstation). Interesting models were first identified by distance statistics between the training set and the trained HMM (average, maximum, and sample deviation). The three structurally aligned sequences (SELB, Ef-Tu, and FIEC2 [11]) were then aligned to the model, and the results compared to the true alignment (automation of this processes is currently under development). This weeded out many models, eventually leaving a model that produces quite good multiple alignments between elongation factor regions and can also be used to identify elongation factors in the protein sequence databases. The multiple alignment of the three test sequences and one additional sequence is shown in Figure 3. Note that the third test sequence has 394 amino acids in an insertion state before its elongation factor region begins. The first four alignment rows correspond to and are virtually identical to the structural alignment reported by Forchhammer, Leinfelder, and Böck [11].