HMM Basic SAM Exercise

For this exercise, we will repeat the basic operations of the SAM described in the quick overview of the SAM documentation with a different set of sequences. If you have your own collection of sequences, protein, RNA, or DNA, you should try this with them. Ideally, the sequences will be shorter than 200, and there will be 10-50 sequences, as well as a relatively small database you would like to search for.

The parts of this exercise include building a profile HMM from a set of sequences, using that HMM to create a multiple alignment of a larger group of sequences, and then using the same HMM to search a database.

Part 1 In this example, we will pick sequences from a collection of 50 globins, that you should save in your workspace. Select a number of sequences to start with. Ideally, check with your neighbors and make sure each is using a different sized set of sequences, such as 1, 5, 10, 25, and all 50. Save your selected group of sequences in a file, such as mydata.seq. You may also want save all the sequences you did not use in another file so that we can score known positive examples that were not used to train the HMM (note, however, that a harder test would be to ensure that no close homologs of your training sequences were left in your test set).

Use the buildmodel program to create a model from your set of sequences (of course, for protein sequences, SAM-T99 will do far better; try it with a single one of your sequences in the next experiment):

buildmodel test -train mydata.seq 
SAM trains 3 models in parallel and picks the best. Take a look at the resulting `test.mod' file that includes various comments about the training, and then the paramaters of the linear HMM. There are 49 parameters per model node, or perhaps 6000 that SAM is estimating from your training data. This would be impossible without extensive prior knowledge built into SAM and its regularizers.

Part 2 HMMs can be used to make multiple alignments by aligning sequences individually to the HMM. Align the sequences to the model:

align2model test -modelfile test.mod -db globins50.seq
Here, the full set of positive sequences forms your database. For this command, also try two other alignment options: `-sw 2', for local alignment, and `-sw 2 -adpstyle 5'. We have found that `posterior decoded multiple alignment' is one of the best alignment methods. Unfortunately, it is quite time consuming, as it involves both an all-paths dynamic programming and a single-best-path dynamic programming, and I have not yet implemented this as a multi-level reduced-space algorithm.

The a2m format is not particularly readable. It uses upper case for match columns, hyphens for deletes, and lowercase for unaligned (insertion) parts of the sequences. The `prettyalign' program can be used to make things a little nicer, as in

prettyalign test.a2m > test.pretty
Note that `prettyalign' is a non-standard SAM programs that has a different commandline format.

Part 3 Next, use the hmmscore program to search both your small database of positive examples. Those who began with only a single sequence may see some of the positive examples getting poor Z-scores. Iterative HMM training (as in SAM-T99) can produce better models from a single sequence than we can with a (potentially) poorly chosen set of sequences.

hmmscore pos -modelfile test.mod -db globins50.seq -sw 2
This will create `pos.dist', a list of your sequence IDs, their simple null model score, their reverse null model score (more about that when we discuss SAM-T99), and an estimated E-value based on the reverse null model score. How do the training sequences score? How do the other sequences in the database score?

Part 4 Now, let's try it with a larger database. I have created a 5000-sequence subset of SWISSPORT, sw5000.seq (2MB) or sw5000.seq.gz (1MB) with the following SAM command:

sundance:~/.html -170-> randseq sw5000.seq -nseq 5000 -db $prot/data/swissprot/sprot.fas
SAM: randseq v3.3.1 (December 20, 2001) compiled 12/21/01_15:35:19
 Database Files:  /projects/compbio/data/swissprot/sprot.fas
Picking 5000 of 115106 sequences.
Database: N=115106 Length Min=   3 Ave= 367 Max=6669 SD=334
Subset:   N=  5000 Length Min=   8 Ave= 356 Max=5430 SD=326
sundance:~/.html -171->
Download the file (or use the locally-installed version) and use the unix egrep command to see how many globins are in this subset:
egrep GLOBIN sw5000.seq
You could even `pipe' this through the word-count program, `wc' to see how many lines there are:
egrep globin sw5000.seq | wc
Of course, using annotations to do sequence searches is always a dubious process; let's see how many globin-like sequences your HMM turns up:
hmmscore all -modelfile test.mod -db sw5000.seq -sw 2
Compare these results to your keyword search, and also to those of your neighbors using different numbers of sequences.

Part 5 Use the `makelogo' program to generate a postscript version of the model.

makelogo test -modelfile test.mod
gs test.eps
(Your postscript previewer might be called `gv' or `ghostview'.)
Find the conserved columns of your alignment. SAM Version 3.3, when released, will include a sequence logos program that will produce diagrams similar to those of the SAM-T99 tutorial.

Part 6 If you were working with a set of sequences containing a common domain, you might want to perform hmmscore with the `-selectmdalign 8' option. This will iteratively look for multiple occurrences of your model within the sequences, creating a `.mstat' file that has the found domain scores and a `.mult' file that includes an a2m format multiple alignment of the found regions with the sequence IDs modified to indicate the location of the match.

After performing these basic SAM operations, we will use the SAM-T99 method to build exceedingly sensitive HMMs from a single sequence or a seed alignment in the second exercise