: 3 Quick overview
: SAM (Sequence Alignment and
: 1 Introduction
Ìܼ¡
Information on previous versions follows the program descriptions.
See Section 11.5.
July, 2005.
- Addition of guide sequence labeling (dbguide) to probabilistic sequences
read from SAM HMM files When HMM's are used as database sequences.
Probabilistic sequences are presently being tested; users may wish to
wait until version 3.6 to make use of this feature.
See Section 7.3.
- The randomize variable, which indicates the number of
synthetic sequences to generate from the model as a means of
introducing random noise has been reduced from 50 to 5. This parameter
was originally set at 50 when very large training sets are used with
SAM. Because of SAM's improved performance on small sets, we have
reduced the number.
- Numeric alphabets to enable more than 25 characters.
See Section 7.1.2.
- The simple_theshold variable was previously used in the
internal iterations of multdomain, sometimes reducing the number of
hits when reverse null models were used. This has been corrected.
See Section 10.2.5.
- Codon models. An experimental HMM architecture that enables the
mixing of codon and nucleotide states. Not fully tested.
See Section 8.4.4
and Section 7.1.
- The default value of simple_theshold has been increased
from 0 to 10000. The variable controls when the reverse null model is
calculated. At the setting of 0, only sequences with 0 or better simple
null model scores are scored with the reverse null model, reducing by
half the amount of time required for a database search. However,
since the reverse null model is far more sensitive, this could result
in missed sequences. We have therefor changed the default to be the
most rigourous search, and leave it to the user to decide whether or
not to reduce simple_threshold to 0. See Section 10.2.2.
- The manual includes a new section on reducing hmmscore
execution time. See Section 10.2.2.
- A new type of sugery, sequence surgery has been added to
buildmodel. When a model is being built to match or
canonical sequence, a single-sequence a2m alignment of that sequence
can be provided to buildmodel to guide surgery. See Section 9.2.1.
- The default value of SW has been changes to 2, meaning
that by default local scoring and alignment is performed.
See Section 10.1.2
and Section 10.2.4.
- A new variable SW_train, with a default of 2, controls
global or local training in buildmodel. To prevent model creap,
with frequency-based surgery (the default), the initial and final node
of a model are never deleted in the surgery procedure when when SW_train is set to local
(2) or domain (3).
- Background distributions for internal or user-defined alphabets
can now be specified from a file with the alphbackfile option.
See Section 7.1.3.
- A sign error in score adjustment of simple null models was
corrected, and the default adjustment by log of the
sequence length was scaled from 1.0 to 1.5. See Section 10.2.1.
- The uniqueseq program previously removed duplicate
sequences from alignements before performing alignment-based (percentid) thinning. The default now is to only perform thinning
based on the alignment. If the prior behaviour is desired, set aligncheckonly to 0. See Section 10.12.9.
- The verbose option default has been changed to 0,
reducing the number of messages hmmscore and uniqueseq
produce by default. For example, the progress dots hmmscore
prints are no longer the default.
- Support for a variant NBRF format has been added, in which
sequence identifiers are specified with `
>' and sequence lines
are prefaced with digits specifying the index of the first character.
July 2003.
- The default internal protein prior libary has been changed from
reccode4.20comp to our current favorite recode3.20comp.
See Section 8.1.1.
- Posterior-decoded alignment has been made more efficient. Also, the
maxposdecodemem variable sets the maximum amount of memory to
use for the required very large dynamic programming matrix.
Sequences that are too long for alignment within this matrix are
aligned to the HMM using the Viterbi algorithm.
- Addition of not_id for selecting sequences not to be
used. id and not_id are used whenever a database is
loaded with db. See Section 7.4.
- Corrections to PAUP alignment file reading. Performing multitrack
length check after removing non-sequence characters.
- New programs sam2psi and psi2sam for converting to
and from PSI-BLAST checkpoint files, loosing information about
insertion states and transition probabilities. See Section 10.10.6.
- The SAM-Target2K model building script target2k is included in the
distribution, as well as the previous target99 script. The
documentation has largely been updated to discuss target2k,
though some of the examples are based on target99 or use
obsolete model building scripts. target99 is no longer
supported. See Section 4.
- The number of Sam-T99/T2K model building scripts has been
reduced. Based on our experiences, we have eliminated all the ``fw''
scripts, ``varh50'', and ``fh'' scripts. Choices are now reduced to 3
basic model building scripts, respectively for tuning models to find remote
potential homologs, close potential homologs, and very close potential
homologos. See Section 4.9.
- A buildmodel error that caused transitions into node 1 of a
model to be incorrectly tabulated has been fixed. This error was
introduced in July 2002, and also affected the surgery procedure,
causing too many new nodes to be added to a model, destroying the
effectiveness of surgery.
December 2002.
- Probabilistic sequences read from SAM HMM file
format. See Section 7.3.
- Scaled reverse null model scoring.
- Several new local structure alphabets are
available. The listalphabets program provides detailed
information on all internal alphabets and regularizers. See Section 7.1
and Section 10.12.1.
- Posterior-decoded alignment has been made more efficient. Also, the
maxposdecodemem variable sets the maximum amount of memory to
use for the required very large dynamic programming matrix.
Sequences that are too long for alignment within this matrix are
aligned to the HMM using the Viterbi algorithm.
- The fragfinder program will find short, gap-free sequence
fragments that strongly match a model. See Section 10.4.
- The pathprobs numerical output has changed, and can now
trim alignments based on path probabilities. See Cline,
Karplus, & Hughey, Bioinformatics 18(2):306-314, 2002.
See Section 10.8.
December 2001.
- Model libraries. Model libraries may now be specified and
scored with hmmscore. Each member of a model library is a set of
SAM settings, a model, and a model name. RDB files now always provide
model names. Model libraries can be calibrated (highly recommended)
as an option to hmmscore. See Section 10.2.10.
- The genseq program will generate random sequences based on
a regularizer or Dirichlet mixture regularizer. See Section 10.12.3.
- The makelogo program is a new viewing tool for SAM
models. See Section 10.10.4.
- The new get_fisher_scores outputs Fisher score vectors for
input to an external distriminitive learning program.
See Section 10.6.
- Posterior decoded alignment has been found to have a slight edge
over Viterbi alignment. However, the current implementation of this
algorithm requires a very large dynamic programming matrix that may
be beyond the memory limits of the platform. New to this version,
sequences that are too long for the posterior decoded alignment
calculation will be aligned to the model using the Viterbi algorithm.
See Section 10.1.
- Complex null models are no longer supported. An error in hmmscore that ignored user's null models has been fixed. See Section 10.2.
- The ability of hmmscore to read and sort score files,
undocumented for the last several versions, has been removed.
- The postscript header file used by drawmodel is now
produced directly by the program. The SAM_PS environment variable,
which previously indicated the location of the postscript header, has
been eliminated. See Section 10.10.2.
- For proteins, the recode4.20comp prior has been made an
internal default so that buildmodel will use it even without proper
setting of the PRIOR_PATH environment variable. See Section 8.1.1.
October 2000
- Null sequences are now properly scored and aligned by hmmscore and other programs. The buildmodel program
automatically removes null sequences from the training set.
See Section 7.4.
- In the hmmscore program, setting many_files to
causes the score file to be printed to standard output rather than
to the normal .dist file, setting to
causes .mstat files to be
printed to standard out, and to 6, both are sent to standard out with
undefined interleaving. This variable is treated as a binary
bit-vector, so setting to, for example, 3, will result in buildmodel generating many files and hmmscore sending .dist
output to standard output. See Section 10.2.3.
- Martin Madera and Julian Gough have written a perl converter between
SAM and HMMer 2.0 formats that can be downloaded from the SAM WWW page
or, for the most up-to-date copy, from
http://www.mrc-lmb.cam.ac.uk/genomes/julian/convert/convert.html.
- Problems with Irix 64 distribution corrected.
July 2000
- Secondary structure alphabets based on DSSP labels. See Section 7.1.
- Multi-track HMMs. The hmmscore and align2model
programs can now make use of multi-track HMMs and sequences. For
example, a set of protein sequences with associated secondary
structure sequences can be scored or aligned to a protein
model with character emission probabilities calculated according to
the protein model and a second secondary structure model
(track). See Section 10.2.6.
- The predict_track program can be used to
make consensus predictions from a primary sequence alignment about a
secondary structure track. The program is experimental and not yet
fully optimized. See Section 10.9.
- The pathprobs program can be used to generate the
posterior probabilities of each character in an alignment given an
alignment and a model, in RDB format and (with reduced information)
a2m format. Changes to accommodate this format in prettyalign
mean that a2m files with compressed insertions (as generated by
prettyalign) will not be read as alignments. In general, prettyalign output should not be used as an input format.
See Section 10.8.
- The SAM-T99 target99 script now supports NCBI Blast 2 as
well as WU-Blast. SAM-T99 has been developed using WU-Blast; results
with Blast2 will differ slightly. See Section 4.
- An error that caused the regularizer_file to be read for
a user's null model rather than the nullmodel_file has been
fixed.
- Previous to this release, buildmodel local training
incorrectly always performed global alignment to a model with FIMs on
both ends. This has been corrected. To approximately duplicate
previous work with local training settings, use sw 3. See Section 9.6.
- Simple null model selection has been improved. If FIM tables
are automatically set (i.e., FIM_method_score is positive),
the null model emission probabilities correspond to that table,
whether or not any FIMs are present in the model. If it is negative,
the null model is taken from the specified source, and any
automatically-added FIMs are taken from that source as well. If it is
zero, the null model corresponds to the first FIM present in the
model, or to geometric average of the match states. The null model transition
probability is set according to the value of fimtrans. The
null model probabilities are affected by fimstrength whether or
not there are FIMs in the model. This eliminates the previous
inconsistencies that arose from using the first insert node as a null
model whether or not it was a FIM. SAM is now quite explicit when
insert/FIM tables are changed and when FIMs are added.
- The viterbi_threshold can be used with hmmscore to
prefilter sequences with a Viterbi NLL-NULL calculation before
performing the more expensive (and more sensitive) all-paths (EM
style) calculation. See Section 10.2.1.
- Previously, when E-values were calculated but a sequence did not
have a reverse-null-model score due to simple_theshold, the
E-value was calculated from the simple null model, leading to
incomparible E-values. Now, when the reverse score is not calculated,
the E-value is reported as the maximum possible E-value (database
size). See Section 10.2.1.
- SAM now includes a dedicated FASTA reader that is far quicker
than the more flexible readseq package. Sequence I/O, measured
by running checkseq, has been sped by a factor of 10. Sequence
memory use can be reduced by unsetting the keepannotations
parameter.
- Models can be created to enable hmmscore to perform Smith
& Waterman alignment and scoring. SAM will calculate E-values using
the reverse-sequence null model. See Section 10.2.8.
- Default value of segment_size has been increased from
100 to 1000 sequences so that
0.5-1MB of protein sequence data
is in memory at any one time. For particularly long sequences, you
may wish to reduce.
- Addition of adpstyle, the dynamic programming style used
for alignments and multiple domain alignments. Posterior-decoded
alignment based solely on character emission posteriors is now
available. Scoring according to either posterior alignment option can
also be performed. See Section 10.1.
- Reoptimization of inner loop checkpoint placement. The default
setting of maxmem has been increased to use up to 20MB of memory
for dynamic programming.
speedup for buildmodel
and sequence alignment. Posterior-decoded alignment does not yet use
reduced space.
- Addition of the randseq program, which can be used to
randomly select sequences from a database, and splitseq which
can split a database according to sequence length (particularly useful
to filter sequences that can cause posterior-decoded alignment to run
out of memory).
See Section 10.12.5
and Section 10.12.6.
October 1999
- The SAM-T99 iterative method for remote homology detection.
This is the vastly preferred method for building an HMM from a single
protein sequence and for weighting sequences when an alignment is
available. See Section 4
and Section 9.4.3.
- Inclusion of the view_pdoc program for viewing the
posterior decoded alignment. This may assist in checking alternate
paths in an alignment. See Section 10.5.
- The uniqueseq program can now be used on alignments to
eliminate sequences that match other sequences in the alignment.
See Section 10.12.9.
- The hmmscore program will calculate E-values for
reverse-sequence null model scoring (scores better than 1e-300
are reported as 1e-300). Internal Z-scoring has been eliminated.
Score data can be output in the RDB format. See Section 10.2.
- Also in hmmscore, the reverse null model is now the
default null model calculation. This doubles runtime over the simple
null model, but is more accurate. See Section 10.2.
- The default null model (as well as all FIMs) now includes by
default a self-loop transition probability equal to the geometric
average of the match to match transitions in the HMM (fimtrans is 1.0). The way in which insert to insert arcs are set for
negative values of fimtrans has changed. See Section 8.5.
- Regardless of input format, selected sequence output is always
in FASTA format. Sequence annotation lines are now preserved in
sequence output files (sel, a2m, and mult files), and are truncated to
the first 50 characters in dist and mstat files.
- In buildmodel, constrained trainining is now supported.
Specific residues can be constrained to specific model nodes during
training. This serves as a method of incorporation prior knowledge
about the training sequence, such as structurally similar
regions. See Section 9.7.
- The buildmodel seed parameter has been renamed
randseed to avoid confusion with the SAM-T99 seed alignment
parameter.
: 3 Quick overview
: SAM (Sequence Alignment and
: 1 Introduction
Ìܼ¡
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group