This page answers common questions concerning SAM and SAM-T02. If you do not find a solution to your problem here, please inform us at sam-info@cse.ucsc.edu.
Here are the main paper citations (in BibTeX format):
@string{prosfg= "Proteins: Structure, Function, and Genetics"}
@string{jmb= "Journal of Molecular Biology"}
@string{bioinf="Bioinformatics"}
@article{SAMT98,
author="Kevin Karplus and Christian Barrett and Richard Hughey",
title="Hidden {Markov} Models for detecting Remote Protein Homologies",
journal=bioinf,
year="1998",
volume=14, number=10,
pages="846-856",
annotate="This paper provides a fairly detailed presentation
of the SAM-T98 method for finding remote homologs, including
both the method and the results on FSSP, SCOP, and PIR test sets."
}
@article{Parketal98,
author="J. Park and K. Karplus and C. Barrett and R. Hughey and D. Haussler and T. Hubbard and C. Chothia",
title="Sequence Comparisons Using Multiple Sequences Detect
Three Times as Many Remote Homologues As Pairwise Methods",
year="1998",
journal=jmb,
volume=284, number=4, pages="1201-1210",
note="Paper available at {\def\xx{\discretionary{}{}{}}
{\tt http://www.mrc-lmb.cam.ac.uk/{\xx}genomes/{\xx}jong/{\xx}assess\_paper/{\xx}assess\_paperNov.html}}"
}
@comment{The xx definition is an attempt to keep BibTex from inserting
%, which adds an extra space---it is not entirely
successful as BibTeX STILL inserts one of the extraneous spaces.
}
@article{SAMT2K-CASP4-proteins,
author="Kevin Karplus and Rachel Karchin and
Christian Barrett and Spencer Tu and Melissa Cline and
Mark Diekhans and Leslie Grate and Jonathan Casper and
Richard Hughey",
title="What is the value added by human intervention
in protein structure prediction?",
journal=prosfg,
year=2001,
volume=45,
number="S5",
pages="86--91"
}
@article{SAM-T02,
author="Kevin Karplus and
Rachel Karchin and
Jenny Draper and
Jonathan Casper and
Yael Mandel-Gutfreund and
Mark Diekhans and
Richard Hughey",
title="Combining local-structure, fold-recognition, and new-fold
methods for protein structure prediction",
journal=prosfg,
year="2003",
note="in press, special CASP5 edition"
}
There are several possible causes. To find out more, go up one level in the URL (omitting the "summary.html"). This gives you the full directory of result files, including the error messages. Sometimes this will help you find the problem, sometimes not. The most common problem is one that is hard to figure out even with the error message files. It is one we should be detecting automatically, but currently are not---having an all-lower-case sequence. The SAM-T02 method uses a script that assumes that lowercase characters are insertions (not to be included in the HMM). If all the letters are lowercase, the HMM has 0 length, and nothing is being modeled.
The fix is to convert your sequence to all upper-case and resubmit.
There are a lot of outputs from our SAM-T02 web server, and we haven't had time to write an interpretation guide for all of them. (Maybe when we get some funding, and don't have to rely on volunteer labor for everything ...)
The ones to concentrate on are
The secondary structure logos show how confident the prediction is for each position. The height is the information gain (the relative entropy of the predicted probabilities and the background probabilities). Where the confidence is high, the predictions tend to be much more accurate. We have found the structure logos to be the most informative way to view the predictions.
There may be multiple models in a single PDB file, corresponding to different alignments or different templates. Usually the first model is the one most likely to be right.
Don't blindly trust these 3D models---at the very least look at the E-value for the template in the best-scores files. If the E-value is poor (greater than 1.0e-02, for example), then the model should be regarded as speculative.
For sequences submitted to the web server, or for alignments other than the ones we provide models for, can create a crude model from an alignment using the server at http://predictioncenter.org/local/al2ts/al2ts.html (For more details, see below.)
The E-value is an estimate of approximately how many sequences would score this well by chance in the database searched. For SAM-T02, E-values less than about 1.0E-5 are very good hits and are very likely to have a domain of the same fold as the target. E-values larger than about 0.1 are very speculative---if your best hit is in this range then the correct fold is likely to be one of the top ten or twenty hits (unless the target is a new fold), but it is difficult to tell which of the top hits is the right one.
Between 1.0E-5 and 0.1 the goodness of the match will vary somewhat from target to target, but will often be a good match.
When you get an extremely small E-value (say 1.e-10 or
smaller), then the alignments you get from SAM-T02 may not be
any better than alignments that you get from sequence-sequence
aligners like Smith-Waterman, FASTA, or BLAST.
SAM-T02 is designed to to do good fold recognition and
alignment in the difficult cases, and it may give up some
performance on the "easy" ones.
We report predictions for three alphabets (DSSP, STRIDE,
STR [or STR2]) and one reduced alphabet (DSSP_EHL).
For some of our predictions, we also predict residue burial.
The DSSP alphabet is
defined by the DSSP program (except that we combine the rare
Pi-helix letter "I" in with the alpha helices "H"). The STRIDE
alphabet is defined by the STRIDE program [Frishman D, Argos
P. Knowledge-based protein secondary structure
assignment. Proteins. 1995 Dec;23(4):566-79.]
(again, we combine I with H).
The STR (Strand) alphabet is an enhanced version of
DSSP
currently being developed at University of California,
Santa Cruz. The concept was originated by post-doc
Yael Mandel-Gutfreund.
We have found that
two-track hidden Markov models built with a STR secondary
track are particularly good at fold recognition and
target-template alignments.
The original DSSP alphabet uses the letters
"H" (alpha helix), "B" (isolated beta-bridge), "E" (extended strand
in beta ladder), "G" (3/10 helix), "I" (pi helix), "T" (H-bonded
turn) and "S" (bend).
STR subdivides DSSP letter "E" into 6 letters, according to
properties of a residue's relationship to its strand partners.
(We
also group the rare pi helix class "I" with
the alpha helix class "H".)
In the diagram, dots indicate the strand of the residue being
assigned. In a beta sheet, this strand is either surrounded
by two parallel partners "P", two anti-parallel
partners "A" or one anti-parallel and one parallel
partner "M". Edge strands (that have only one beta
strand partner) have either a parallel partner "Q"
or an anti-parallel partner "Z". Finally, we retain
the "E" label for strand residues to which DSSP
assigns no partners (generally beta bulges).
We have also defined STR2 and STR3 alphabets:
The ALPHA prediction is not currently provided by the web
server, but is provided on some of our pre-computed web pages.
The Alpha angle is the torsion angle of C_alpha(i-1),
C_alpha(i), C_alpha(i+1), C_alpha(i+2). We have divided the
range up into 11 classes (not mnemonically named):
The DSSP_EHL alphabet is used in CASP and
EVA
evaluations of secondary-structure prediction.
It combines all helix types (G, H, I) into one class (H), and both
beta bridges and beta strands into one class (E), with
everything else in an "other" class (variously called either C
or L).
Currently, we do not predict DSSP_EHL directly but combine our
predictions for the more detailed alphabets to get a DSSP_EHL
prediction. (We have not yet done extensive tests to see if
this is better or worse than predicting DSSP_EHL directly.)
Our burial predictions use various alphabets using letters
A-G or A-K (for 7 or 11 levels of burial). In all the burial
alphabets, A is the most exposed and burial gradually increases
to G or K, which are fully buried. Currently, the web servers
do not provide burial predictions, but we have included them on
the SARS (and soon the yeast) pre-computed predictions.
The alphabet we are currently using counts the number of C-beta
atoms in a sphere of radius 14 around the C-beta atom of the
residue (excluding itself), as Rachel Karchin found this
alphabet to have good conservation and predicatability.
[Rachel Karchin and Melissa Cline and Kevin Karplus,
"Evaluation of local structure alphabets based on residue burial",
Proteins: Structure, Function, and Genetics, in press.
]
name range A 165<=alpha<-170 B -136<=alpha<-103 C -103<=alpha<-68 D -68<=alpha<-17 E -170<=alpha<-136 F -17<=alpha<8 G 8<=alpha<31 H 31<=alpha<58 I 58<=alpha<85 S 85<=alpha<140 T 140<=alpha<165
name range A count<27 B 27<=count<34 C 34<=count<40 D 40<=count<47 E 47<=count<55 F 55<=count<66 G 66<=count
Currently, the SAM-T02 query page only accepts a single sequence in FASTA format. In FASTA format, a sequence must have a unique name identifying the sequence in addition to the sequence residues. The name starts with a > (greater-than) character at the beginning of the line and continues to the first white space on the line. The rest of the name line is a comment, which is ignored. The sequence itself starts on the next line following the name line. The FASTA file should have the sequence itself in uppercase, thoug all-lowercase sequences will be accepted. Evetually we'll accept a2m alignments as input, in which case upper and lowercase distinctions will matter.
The SAM-T02 query page returns multiple alignments in FASTA, pretty-printed, and HTML formats, and pairwise alignments of the target to the best-scoring template candidates in FASTA and .al (CASP) format. The FASTA format is really our a2m format, which the SAM tools understand but some conversion tools misinterpret. The SAM package includes the "prettyalign" program, which can be used to add extra dots to an alignment, making it easier for tools that don't understand the a2m format to convert to other programs.
The only graphical results from SAM-T02 are the sequence logos, which are all in EPS (Encapsulated Postscript) format. There are many programs available for viewing EPS files---which one you chose is largely a question of what platform you run on. The most popular viewer for Unix machines is the free "ghostview" program (also available for MS-Windows and Macs) http://www.cs.wisc.edu/~ghost/gsview/index.htm
If you wish to see a 3D model of the predicted protein, you have to convert the alignment to a 3D structure. Although we are working on tools to do this, they are not yet ready for release. In the meantime, your best bet is to take the .al files and submit them to the AL2TS server: http://predictioncenter.org/local/al2ts/al2ts.html That server only accepts their own ".al" format, which we provide for just the alignments based on the 2-track-protein-STR hidden Markov models (these are usually the best alignments).
Most of the sequence IDs in a SAM-T02 a2m file come from the IDs in the NR database. The sequence IDs may be modified by SAM to indicate the first and last sequence positions that matched the SAM-T02 HMM.
For example in the following sequence ID taken from a SAM-T02 alignment,
>gi|16080670|ref|NP_391498.1|_1:234 (NC_000964) similar to hypothetical
proteins [Bacillus subtilis] gi|7450240|pir||G70067 conserved hypothetical
protein ywqL - Bacillus subtilis gi|1894750|emb|CAB07450.1| (Z92952)
product similar to E.coli YjaF protein [Bacillus subtilis]
gi|2636142|emb|CAB15634.1| (Z99122) similar to hypothetical proteins
[Bacillus subtilis]
the original sequence name gi|16080670|ref|NP_391498.1|
has had _1:234
appended to indicate that the SAM-T02 HMM for the alignment matched
the sequence starting a sequence position 1 and ending at sequence
position 234.
You mentioned that FASTA, BLAST, and PSI-BLAST found a high-scoring similar sequence that SAM-T02 did not find. This happens fairly often---the most common causes are composition bias and large helices (particularly coiled-coils). The programs FASTA, BLAST, and PSI-BLAST can all be fooled into reporting very strong scores for sequences whose only similarity is that they both have long amphipathic helices. SAM-T02's reverse-sequence-null model cancels this signal (as well as composition bias and length signals), resulting in a method with many fewer false positives. A few true positives are lost, but not too many.
As an example, the leucine zipper 1ce0A gets only 25 sequences in the 1ce0A.t02.a2m alignment. The 19 PDB sequences in the alignment are all homologs (at least, similar structure and somewhat similar sequence). Other methods are likely to get almost any coiled-coil as a strong hit. This is an example of the reverse-sequence-null model removing a lot of trash (and possibly some good stuff) due to helicity signals.
Another common problem is with the cysteine-rich metallothionein appearing in searches for proteins that had highly conserved cysteines---even ones with very different structure and function. SAM-T02 only includes metallothionein when almost all the cysteines line up.
Note: the compositional corrections to PsiBlast in August 2002 made the PsiBlast multiple alignments almost as good as the SAM-T02 alignments---the contamination by unrelated sequences was greatly reduced.
If your protein is a large, multi-domain protein, your best bet is to break it up into pieces (near domain boundaries is best, if you can guess where those are). Protein structure prediction generally works better on single domains in any case.
The SAM-T02 method builds models the size of the input sequence. Finding domain boundaries when no structure is known is an art that we have not attempted to automate (though other researchers have).
We have generally found it best to do a search first with the full-length protein, then remove any domains that are strongly predicted, and do the prediction again on what is left. A weaker prediction for a second domain may be masked by strong predictions for the more easily found domain in the full-length protein.
Failure to conserve an active-site residue could mean several things:
We have done some tests of SAM-T99 (which is very similar to SAM-T02 for constructing multiple alignments) as a multiple aligner (using the BAliBase test suite), and found that the alignments produced by SAM-T99 are about as good at those produced by CLUSTAL. You can try realigning them with other multiple aligners (such as CLUSTAL, PRRP, or DIALIGN), but it is probably a good idea to thin the alignment to a few diverse sequences first, since those aligners get very slow when given many sequences. If the alignment changes dramatically, then there is good reason to suspect the alignment.
Currently, the best multiple alignment method we know of is T-Coffee, but its run time is proportional to the cube of the number of sequences being run, so most SAM-T02 alignments would need to be drastically pruned before being realigned with T-coffee. Another one that we have heard is quite good and fast (supposedly better than T-coffee, fast enough to run on alignments of 1000s of sequences) is Muscle at http://www.drive5.com/muscle.
We have not yet had much success with our attempts to score HMMs against HMMs, though we have not, of course, tried all possible algorithms.
Our best method so far is to combine the results of scoring all template sequences against a target HMM and the target sequence against all template HMMs. There is probably a better method, and some people have had success with profile-profile alignment, but (so far) it has not worked well in our hands.
Yes, we encourage you to download a copy of the SAM software. It's available free for academic use at the SAM web site. You may find additional functionality you need with the full SAM package that is not currently accessible through our web site.
SAM-T99 and SAM-T02 have not been optimized for transmembrane predictions. They are "ok" on transmembrane predictions, but not nearly as good as tools optimized for that task. We've been told that the TMHMM server is currently the best predictor for transmembrane helices, but we've not done any tests ourselves.
The probabilities returned by the SAM-T02 server are from neural nets.
The neural nets were trained to maximize
sum_examples log Phat(correct letter | example)
where Phat is the neural net output of predicted probability for a letter.
The calibration has been checked and is pretty good. That is, in the cases where the neural net has said that the probability of helix is 0.80 about 80% of the time there really is a helix there. The cost function used in training makes the calibration very tight on the training set.
Of course, the neural net is using a multiple alignment as an input, so if the target sequence is misaligned or has a different structure from the sequences it is aligned to, the neural net can produce a confident, but incorrect, secondary structure prediction.