This paper presents results of blind predictions submitted to the CASP3 protein structure prediction experiment. We made predictions using the SAM-T98 method, an iterative hidden Markov model based method for constructing protein family profiles. The method is purely sequence based--using no structural information--and yet was able to predict structures as well as all but five of the structure-based methods in CASP3.
Kevin Karplus
,
Christian Barrett,
Melissa Cline, Mark Diekhans, Leslie Grate, Richard Hughey
17 May 1999
HMMs combine the best aspects of weight matrices and local sequence alignment methods, and can be used to assign probabilities to proteins in database search [6]. Our HMM fold-recognition method differs from protein threading methods [10,19,14,15] in that pairwise interactions are not modeled or used. Instead, we employ Bayesian methods [3,2,17] to incorporate prior information in the form of Dirichlet mixture densities [20] over position-specific amino acid distributions. The components of the mixture reflect different patterns of sequence conservation and can be combined with data from aligned homologs to form data-dependent estimates of amino-acid probabilities.
In the CASP3 experiments, we used the recently developed SAM-T98 remote homology detection method to compare the CASP3 targets against a database of proteins whose structures are known (Section 2). We discuss how successful this method was in finding similar structures for the targets in Section 3, and discuss the lessons learned in Section 4.
A prediction server using the SAM-T98 method discussed here is
available on the World-Wide
Web
, as is
documentation and licensing information for the SAM hidden Markov
model software suite [9].
During each round of iteration, the score threshold in step 2 is made less stringent in order to capture less similar sequences that are still, we hope, homologs. The final multiple alignment, called the SAM-T98 alignment, is used to construct the HMM used for database search and alignment.
For CASP3, we first built a SAM-T98 HMM for every sequence in a representative set of structure templates from PDB [4] and for every target sequence. To find possible templates for a target sequence, we scored all of PDB with the target HMM, scored the target sequence with every template HMM, and summed the two scores. The structures corresponding to the best summed scores were then investigated manually. For most targets, we submitted only one structure as our prediction--usually the best-scoring one. If we had a high-scoring PDB sequence that was not in our template library, we sometimes augmented the template library with an HMM built from this PDB sequence, in order to be able compare summed scores. We ended up with about 2100 HMMs in our template library.
Since we predicted on all of the targets for CASP3, we have divided them into three categories to simplify their evaluation. These categories are based on the difficulty of finding the correct structure. Those targets that had very similar sequences of known structure have been placed in the easy targets category, while those that had only more distantly related known structures are members of the moderately difficult targets category. Those targets that had little or no similarity to known structures are in the very difficult targets category. Table 1 shows the results for the first two categories. Except for T0085, the multi-heme cytochrome, a submitted cost less than -9 was a successful prediction, though scores as strong as -27.37 would have been incorrect, had we not filtered out those predictions by hand.
![]() |
Our submitted alignments for these targets were generally the automatically produced alignments, sometimes subject to minor hand editing. Figure 1 shows our predicted alignment of T0074 to the template structure 2scpA. It shows that our alignment was quite accurate, apart from the first region which is shifted by two residues. The figure is also an example of one of the few cases where hand-editing improved on the automatic alignment. In this case, the automatic alignment had shifted the 21 residues PWAVKPEDKAYKYDAIFDSLS of T0074 7 residues toward the N-terminal region of 2scpA, while the hand alignment shifted them 4 residues toward the C-terminus.
Overall, our alignments for the easy targets were usually among the best alignments submitted to CASP3, even though we used no structure information in generating them.
Our 3D prediction for target T0076 was quite poor. We aligned T0076 to 1almC (a theoretical model) because our top hit, 2mysC, had a probable mistracing. We thought that 1almC corrected this mistracing, but since it did not, our 3D prediction was poor even though the sequence alignment was accurate. We would have done better to use the second-highest-scoring template (1wdcC), which has an accurate 3D structure.
Because of the low similarity between the targets and templates, even the ``correct'' predictions had alignments that were accurate only for portions of the target sequence. We used local alignment to find the folds, but global alignment to provide the submitted alignment. The global alignments generally aligned more residue pairs than the correct structural alignments, but if we has submitted the local alignments, we would have missed many of the residue pairs that were correctly predicted. Because RMS deviation is very sensitive to over-prediction, our RMS scores for the entire alignment look poor, even though we often have a well-predicted core alignment. Determining which parts of an alignment are worth predicting and which should be removed remains a difficult problem for us.
Our T0085 prediction (2cthA) was an incorrect multi-heme cytochrome, despite the high score. Matching three heme-binding sites provided a strong similarity signal, even though the overall fold was different. The correct multi-heme cytochrome was in our top 6 hits (out of about 2000 templates).
For T0053, we were misled by our post-scoring sequence analysis. We considered the correct template 1ak1 for T0053 (our 10th highest-scoring template), because it scored well in one of our template libraries and was also a chelatase. We correctly rejected our top hit (1djxB), because it did not cluster well the known metal-binding residues in T0053. We chose 1fvkA (our 8th highest scoring template), because it clustered the residues well. Unfortunately, we did not analyze the clustering on 1ak1. The target HMM (which gave the erroneous high score to 1djxB) was poor because there were only three short matches to the target found in the non-redundant protein database by the SAM-T98 method (other than the target itself), so the HMM had to generalize from very little data. In such cases, it may be advantageous to put more weight on the template HMM scores, but we did not attempt this.
For T0071, we were again misled by our post-scoring sequence analysis. We looked at, but rejected, some correct templates (1euu and 1dlhA) for the first domain because of low scores and unconvincing alignments. We wanted to find an SH3 domain, because the C-terminus of EPS15 binds to T0071 and is known to bind to an SH3 domain [18,1]. This hint from the literature was used to decide between a small number of folds, all of which had fairly weak scores with our method.
We were surprised that 2mcm turned out to be a correct prediction for T0046, because the similarity to immunoglobulins was weak and most immunoglobulins are quite similar to each other. Our alignment turned out to be terrible, as can be seen in Figure 2
![]() |
The known active site residues for T0081 clustered well when the target sequence was aligned to 3chy. This was our rationale for choosing 3chy as our prediction. It turns out that 3chy has a similar alpha-beta-alpha structure, but threaded in a different order than T0081. If we do a circular permutation of the chain, we can get a much better superposition of the structures--unfortunately, our method did not predict this circular permutation, but an incorrect alignment. Even constructing an HMM for the chimeric sequence 3chy followed by 3chy does not allow our methods to find the permuted alignment. We would have done better to predict 1rvv1, which scored better than 3chy, and had the correct threading order. We had rejected 1rvv1, because our alignments for it did not cluster the aspartic acid residues, which we had expected.
During the early part of CASP3, we predicted ``new fold'' for targets that produced only weak scores to template structures. For this reason we predicted that T0043 would be a new fold, even though our top hit turned out to be the correct fold.
There were a number of targets for which the correct structural template was in our list of top 10-20 hits, but we were not able to pick it out. These targets, with the rank (out of approximately 2000) of the correct hit in parentheses, are T0043(1), T0053(16), T0054(9), T0059(16), T0063(10), T0067(16), T0071(6), T0085(6). We hope that small improvements to the method, as well as the increase in the number of homologs in the databases will allow the method to discriminate better in future.
![]() |
The low similarity between targets and structures in this category reduced alignment quality considerably compared to the alignments for the easier targets in Section 3.1. We almost always hand edited our automatically generated alignments for these targets. In general, though, hand alignment did not provide much improvement, and we would have done about as well with considerably less effort had we submitted our automatic alignments. For example, we show one of the better hand alignments in Figure 3, but the automatic alignment it was based on did not include the incorrect alignment at the C terminus.
Many of the fold-recognition methods do considerably better on multi-domain proteins when given the domain boundaries, but when we tested our method after CASP3 on the true domains, we gained no benefit from having that extra information [12]. We suspect that our use of local alignment and sum-of-all-paths scoring makes our method rather insensitive to the inclusion of extra domains, so there is little gain from excising them.
Perhaps the biggest lesson learned is that we do not know enough about proteins to adjust SAM-T98 alignments manually. We would have been better off trusting the programs even when they seemed wrong. Protein experts with more knowledge of the proteins would most likely be able to adjust the alignments better than we could. We do still get value from human interaction, but mainly in including functional information or information about known binding sites, rather than in adjusting alignments.
The costs provided by SAM-T98 are a strong, but not perfect, indicator of the correctness of the predictions. Using a calibration of our method from known structures [11], many of the targets had such weak similarities that we had little confidence in those predictions.
We believe SAM-T98 has taken sequence-only methods about as far as they will go. For many of the ``moderately difficult'' targets we did not select the correct structure even though it was in the top 20 hits, so even a small amount of additional information should improve the method significantly. We are currently investigating a few ways to include structure information: building our template library HMMs from structural multiple alignments (rather than single sequences), using information from the structure of the template to trim alignments, and using sequence-structure compatibility measures to evaluate alignments.