next up previous
Next: Conclusion Up: Results and Discussion Previous: Regularization

Case-study: Modeling the SH2 domain with FIMS

 

In this section we demonstrate the use of FIMs for modeling domains. We stress that this is mostly for illustrative purposes, so we will not go deep into any biological implications of the model or alignment. We use the SH2 domain, which is found in a variety of proteins involved in signal transduction, where it mediates protein-protein interactions. For a review see [Kuriyan & Cowburn, 1993]. The domain has a length of about 100.

Initially a file was created with 78 SH2 containing proteins by searching SWISS-PROT (release 30) for the keyword `SH2 domain', see Table 2. 50 models of length 100 were trained in batches of 10 with FIMs in both ends and without surgery. For each batch the model with the best overall score was examined. Of these 5 models the best scoring one happened to cover the SH2 domain almost entirely. It started about 20 amino acids prior to the domain and ended about 20 amino acids early as compared to the the alignment in [Kuriyan & Cowburn, 1993]. Some of the other models also covered part of the domain, whereas others had picked up a different signal. This signal was probably the kinase catalytic domain of some of the proteins in the file. It is quite remarkable that the model can find the domain completely unsupervised, and that might not always be the case.

Using this first model, a search was made of the entire SWISS-PROT database and all sequences scoring better than a Z-score of 7 were examined (not taking sequences with many `X' characters into account), see Table 2. All the sequences in the training set were among those high-scoring ones, except a fragment of length 17 which was then deleted from the data set. Of these high-scoring sequences, 10 had Z-scores of 12 or more and by checking the alignment, we consider it to be certain that they contain SH2. The last sequence with a score of 7.1 we also believe contains SH2. All these high-scoring sequences were now included in the data set. The old and new sequences in the training set are listed in Table 2, and all sequences with Z-scores larger than 4 is shown in Table 3.

  table3163
Table 2: All the sequences with a Z-score greater than 7 using the first model. The score is shown in the columns labeled `Z 1'. All except the ones marked by `***' were part of the training set initially extracted from SWISS-PROT. All the ones marked by `***' were included in the training set. The column labeled `Z 2' show the Z-scores for the new model. The first training set also contained the fragment KLCK_RAT of length 17. It had a Z-score of 0.18 with the first model, and it was then removed from the training set.  

  table3208
Table 3: All the sequences that have a Z-score of more than 4, but not shown in 2. The column labeled `Z 1' contains the scores from the search with the first model and the one labeled `Z 2' the scores with the second model.  

Now the model was modified by deleting the first 20 modules and inserting 30 `blank' modules in the end, so it could better fit the domain. Starting from this modified model, 20 new models were trained on the new set of 88 protein sequences, and the best one selected. This was the final model for the SH2 domain, see Figure 8. Using this model a new search was performed. The new model picked up all of the 88 sequences in the training set, and the smallest score was significantly higher than that of the first model (13.8 compared to 7.1), see Table 2. It also found a new one (KFLK_RAT) with a Z-score of 8.4 that is a fragment and in SWISS-PROT it is described as containing part of the SH2 domain. All other sequences in SWISS-PROT had Z-scores below 5.5, and in the highest scoring ones (Table 3) we did not see any signs of the SH2 domain.


  figure3219

Figure 8: The second (and final) model of the SH2 domain(figure 8 viewed in postscript format). The initial section, at a larger magnification is shown below the complete model. The unshaded modules correspond to secondary structure elements as given in [Kuriyan & Cowburn, 1993].  These elements are tex2html_wrap_inline24 A, tex2html_wrap_inline26 B, tex2html_wrap_inline24 B, tex2html_wrap_inline24 C, tex2html_wrap_inline24 D, tex2html_wrap_inline24 E, tex2html_wrap_inline24 F, tex2html_wrap_inline26 B, and tex2html_wrap_inline26 G. The figure (except the shading) was produced with the program drawmodel in SAM. It is almost impossible to see the actual amino acid distributions at this scale, but the most important things to notice are that the distributions are quite peaked and that the delete states are used rarely in the conserved secondary structure elements, both of which are indications of a good model.


The SH2 domain often occours several times in the same protein, which is not modelled properly by such a model (see Section 3.6). When training, all domains in a given protein contribute to the model (because all paths are taken into account), but when aligning only the occurrence matching the model best will be found. There are ways to find all occurrences by finding suboptimal paths or masking domains already found, the latter of which is currently being added to SAM.


next up previous
Next: Conclusion Up: Results and Discussion Previous: Regularization

Rey Rivera
Thu Aug 29 15:28:54 PDT 1996