In this section we demonstrate the use of FIMs for modeling domains. We stress that this is mostly for illustrative purposes, so we will not go deep into any biological implications of the model or alignment. We use the SH2 domain, which is found in a variety of proteins involved in signal transduction, where it mediates protein-protein interactions. For a review see [Kuriyan & Cowburn, 1993]. The domain has a length of about 100.
Initially a file was created with 78 SH2 containing proteins by searching SWISS-PROT (release 30) for the keyword `SH2 domain', see Table 2. 50 models of length 100 were trained in batches of 10 with FIMs in both ends and without surgery. For each batch the model with the best overall score was examined. Of these 5 models the best scoring one happened to cover the SH2 domain almost entirely. It started about 20 amino acids prior to the domain and ended about 20 amino acids early as compared to the the alignment in [Kuriyan & Cowburn, 1993]. Some of the other models also covered part of the domain, whereas others had picked up a different signal. This signal was probably the kinase catalytic domain of some of the proteins in the file. It is quite remarkable that the model can find the domain completely unsupervised, and that might not always be the case.
Using this first model, a search was made of the entire SWISS-PROT database and all sequences scoring better than a Z-score of 7 were examined (not taking sequences with many `X' characters into account), see Table 2. All the sequences in the training set were among those high-scoring ones, except a fragment of length 17 which was then deleted from the data set. Of these high-scoring sequences, 10 had Z-scores of 12 or more and by checking the alignment, we consider it to be certain that they contain SH2. The last sequence with a score of 7.1 we also believe contains SH2. All these high-scoring sequences were now included in the data set. The old and new sequences in the training set are listed in Table 2, and all sequences with Z-scores larger than 4 is shown in Table 3.
Table 2:
All the sequences with a Z-score greater than 7 using the first model.
The score is shown in the columns labeled `Z 1'.
All except the ones marked by `***' were part of the training set
initially extracted from SWISS-PROT. All the ones marked by `***' were
included in the training set.
The column labeled `Z 2' show the Z-scores for the new model.
The first training set also contained the fragment KLCK_RAT of length 17.
It had a Z-score of 0.18 with the first model, and it was then
removed from the training set.
Table 3:
All the sequences that have a Z-score of more than 4, but not shown in
2. The column labeled `Z 1' contains the
scores from the search with the first model and the one labeled `Z 2' the
scores with the second model.
Now the model was modified by deleting the first 20 modules and inserting 30 `blank' modules in the end, so it could better fit the domain. Starting from this modified model, 20 new models were trained on the new set of 88 protein sequences, and the best one selected. This was the final model for the SH2 domain, see Figure 8. Using this model a new search was performed. The new model picked up all of the 88 sequences in the training set, and the smallest score was significantly higher than that of the first model (13.8 compared to 7.1), see Table 2. It also found a new one (KFLK_RAT) with a Z-score of 8.4 that is a fragment and in SWISS-PROT it is described as containing part of the SH2 domain. All other sequences in SWISS-PROT had Z-scores below 5.5, and in the highest scoring ones (Table 3) we did not see any signs of the SH2 domain.
The SH2 domain often occours several times in the same protein, which is not modelled properly by such a model (see Section 3.6). When training, all domains in a given protein contribute to the model (because all paths are taken into account), but when aligning only the occurrence matching the model best will be found. There are ways to find all occurrences by finding suboptimal paths or masking domains already found, the latter of which is currently being added to SAM.