HMM SAM-T2K Exercise

From UCSC CS243 (Bioinformatics) Homework 2, D. Haussler

Note: This assignment has been modified from one based on SAM-T98. Please work in groups to ensure that the WWW searches go quickly. You can perform the steps discussed below with other sequences, such as the two mentioned at the end, one of which has precomputed results.

Let's pretend that we were participating in the CASP experiment in protein structure prediction. This is a kind of contest held every other year in which "target" protein sequences are proposed for which there was no known structure, and experts worldwide were asked to use their programs to predict the structure for these target sequences. Later, the structures are solved by x-ray crystallography or other experimental means, and everyone meets to evaluate how well the computational prediction methods did. The results for the best predictions appear in a special issue of Proteins.

In the fold recognition category of CASP, the participants are asked to find very, very remote homologies. They are judged not so much on the detailed accuracy of the three-dimensional coordinates that they predict (nobody using currently known methods would be judged as very successful if that were the case), but rather on whether they predict the overall fold of the protein correctly or not. Participants can do this by searching the PDB database of protein sequences whose structure is known, and predicting which if any sequences in PDB will have structure similar to that of the target sequence, sometimes without constructing an explicitly three-dimensional model for the structure of the target sequence. We will see how this is done later in the course. Of course, these methods are also very useful for practicing molecular biologists who want to get a rough idea of what a new protein they have discovered might look like or what its function might be.

Go to the CASP2 homepage. Find the link to the page describing the target sequences that you are supposed to predict structure for. Go there. Click on Target 31. That is the one we will try. You get a page giving information about t31. This is what the CASP contestants were given as information about t31. Cut and paste the sequence into a file, adding a FASTA sequence ID label to the start of your file and call this file "t31.seq".

Go to the BLAST database search page, paste the t31 sequence in, and do a search exactly as you did in hw1, part 2, except (1) this time set the database you search to "pdb" instead of "nr" and (2) do not restrict the search to human proteins. (You can use basic BLAST this time, but feel free to try any of the more advanced versions of BLAST as well. PSI-BLAST is generally the best, and is the most popular.) You should definitely find 1AGJ. It gets a great E-value. Look at the alignment and you see that this is t31 itself. Since the time of the contest, the structure of t31 has been solved an it has been deposited in the PDB database. You should get one or two other "hits". Look at their E-values, and their alignments.

Q1 and Q2: List these other "hits" and their E-values, and for each one say if you think it is a real homology or a chance match.

Enough of BLAST. Let's try hidden Markov models (HMMs). From the UCSC HMM Applications page, the "SAM-T02 Alignment, HMM, Database Query, and secondary structure prediction" link This will perform both a model library (template) search and build a SAM-T2K model of your query for a database (target) search, and then average the results. Type in your email address, paste in the t31 sequence, and submit your query. It will take some time to get your answer back by email (say an hour or so, depending on machine load.) This script is taking the t31 sequence and scoring it against a collection of hidden Markov models, one model for each sequence in a select, representative subset of the PDB database.

While you are waiting, why not try using the SCOP Superfamily server, which is based on SAM-T99.

The target search is more complicated: it builds an HMM from your sequence, and then scores every sequence in PDB against this one HMM. Actually, the process, called SAM-T2K, is a bit more involved than that. First it finds some close relatives of your query sequence in the nr database using a variant of BLAST called WU-BLAST, then it uses these to build an HMM. Then it uses this HMM to search the database again, and finds some more sequences. It repeats this process a third time, creating a final HMM. It uses this final HMM to score all sequences in the database you specified on the web page (in this case PDB), and reports the good hits. It learns more about the family of proteins you are searching for in each of these three iterations. PSI-BLAST works in a similar fashion, but is not as sensitive as SAM-T2K for very remote homologies.

Look at the file you get back by email from the SAM server. For every hit that scored an (estimated) Evalue above 1.0, it has listed the sequence name and the score for this hit in the output file. We'll say more about these scores in a later assignment. They are different from BLAST scores. It also gives the alignments for the top ten hits, although not in as nice a format as BLAST. (Anyone interesting in writing a routine that puts alignments in BLAST output format? It would be welcome.) Note that one of the hits is to the protein 5ptp. The alignment is:

; 1 t0031;
; 2 5ptp
   1 evsaeeikkheekwnkyygvnafnlpkelFSKVDEKDRQKYPYNTIGNVfvkGQTSATGVLIGKN
   2 .............................IVGGYTCGANTVPYQVSLNS...GYHFCGGSLINSQ


   1 TVLTNRHIAkfanGDPSKVSFRPSINTDDNGNTETPYGEYEVKEILQEPFGAGVDLALIRLKpdq
   2 WVVSAAHCY....KSGIQVRLGEDNINVVEGNEQFISASKSIVHPSYNSNTLNNDIMLIKLK...


   1 NGVSLGDKISPAKIGTSNdLKDGDKLELIGYPFDHKVNQM-----HRSEIELTTLSRGLRYYGFT
   2 SAASLNSRVASISLPTSC.ASAGTQCLISGWGNTKSSGTSYPDVLKCLKAPILSDSSCKSAYPGQ


   1 V-----------------PGNSGSGIFNsNGELVGIHSSKVsHLDREHQINYGVGIGNYVKRIIN
   2 ITSNMFCAGYLEGGKDSCQGDXGGPVVC.SGKLQGIVSWGS.GCAQKNKPGVYTKVCNYVSWIKQ


   1 E-KNE
   2 TIASN
Q3: Are there parts of this alignment that look convincing, and if so, which parts? Do you think these sequences are remote homologs or is this just a chance match?

Before spending too much time on the previous question, you might want to check out the structural relationship between these two proteins. To do this, go the FSSP database of protein structure classification. Type in 1AGJ and hit "return". FSSP uses a special, nonredundant subset of PDB that contains only "representative" protein chains, no two of which are identical or almost identical. (FSSP's representative set is different from the representative set used in the UCSC PDB HMM library.) This page shows that 1agjA (chain A of the PDB entry 1AGJ) is chosen by FSSP as a representative sequence for 4 PDB sequences: itself, 1agjB (the other chain in the PDB entry 1ADJ, which is identical to 1agjA), 1exfA and 1qtfA (also extremely similar). Now click on 1agj-A. You get a list of representative PDB sequences that are structurally related to 1AGJ. For each PDB structure you see its 4 letter PDB identifier (and possibly another letter to identify a particular chain), a "Z" score indicating how similar its structure is to that of 1AGJ (a high Z score, say Z > 7, indicates close structural relationship), the length of the part of both sequences that can be aligned ("LALI"), the total length of the second sequence ("LSEQ2"), the percentage of amino acids in one that are the same as the corresponding amino acids in the other sequence ("%IDE"), and the common protein name for the second sequence.

Notice that 5ptp is listed with a Z score of 19.2. A high Z score like this, and an alignment covering almost the entire protein, indicates that the structures of 1agjA and 5ptp are very similar. These structure-structure comparisons are performed by the DALI program, written by Liisa Holm. This program is used in the CASP contest, along with other programs and human intuition, to judge whose answer is right.

Q4: Which of the hits from SAM-T2K does DALI say are structurally related to 1agjA?

To answer this, you will have to query DALI for each of these sequences, because many of them might not be chosen as representatives by DALI, but they may be similar to representatives that are on the list of representatives that are structurally related to 1agj. All representatives with a structural match to 1agjA with a Z score of > 7 can be considered to be definitely structurally similar. Q5 (optional) ((1) The FSSP server also gives an alignment for each structurally related protein, showing parts of the protein that match parts of the target. See how closely these alignments agree with the HMM alignment. You might try using the "compare two alignments" function on the HMM tools web page. Be aware that it is very hard to get the alignment right between two remote homologs from just the sequences themselves, not knowing the structure. You can also use DALI directly to get a structural alignment of 1try and 1agj-A. Look at the structures and see if this alignment makes sense. Also, the two different HMM methods give different alignments. Which is better? (2) Try the CATH and SCOP protein structure web pages (find them under the "more protein folding links" on the resources page for the class.) (3) If you work in a molecular biology lab, try SAM-T98 on a protein that is of interest to the research in your lab. This could be the start of a good class project. )

Q6: You may also want to take a look at the Casp 4 target t125. On this sequence, SAM-T2K finds the homologue 3lynA which is missed by blast. Unfortunately for us, this homology was mentioned in the submitter's notes.

The precomputed t125 results of the SAM-T99 search are available at the fully automated version of CASP4, called CAFASP2.

One of the quickest sequences to try against the server is 2crd, which you could also check as discussed above. When you receive the SAM-T2K alignment from the server, try doing a search locally with it. Take the alignment and use the w0.5 script to build and HMM and then use hmmscore to search a local database using that HMM.

Running SAMT99 (SAMT2K is not yet distributed) locally

With a local version of SAM-T99, the basic command for using this HMM method is (assuming you have a copy of 2crd in the file 2crd.seq)

target99 -seed 2crd.seq -out 2crd-t99

This will produce a SAM-T99 multiple alignment of sequences similar to 2crd in the file 2crd.a2m. The next steps that the WWW server takes is to create an HMM from this multiple alignment, using

w0.5 2crd-t99.a2m 2crd-t99.mod
This script performs sequence weighting on the alignment and constructs a SAM HMM. Next, you can use the resulting model to search a database such as PDB protein sequences
hmmscore 2crd-t99 -modelfile 2crd-t99.mod -db pdbprotein -selectmdalign 4
to create a `2crd-t99.dist' file scores of all sequences, and a multiple domain alignment, `2crd-t99.mult' with scores `2crd-t99.mstat', of sequences scoring E-values better than the default 0.01.

If you have not yet taking the several weeks of computer time required to create a full model library, you can perform a SAM-T99 model library search of Murzin et al's SCOP database, an excellent manual hierarchical classification of 820 PDB structures into folds, superfamilies, families, and the like. This search engine has just been put together by Jong Park at MRC, and will be quicker than a SAM-T99 search because of the smaller database and less overloaded server (and possibly more recent machines)

You can also use the basic SAM programs, as was done in first example.


Modified from David Haussler's Bioformatics Class The CMP 243 Class Page.