next up previous contents
: 11 System installation : SAM (Sequence Alignment and : 9 The buildmodel estimation   Ìܼ¡


10 Related programs


10.1 align2model and prettyalign

After you have obtained a model of your sequence family, align2model can be used to give a multiple alignment of sequences. Often one is just interested in aligning the sequences that were also used to train the model, but in principle any sequence can be included in the alignment.

The multiple alignment is obtained by aligning each of the sequences to the model by the Viterbi algorithm. This has the advantage that it can be done for each sequence independently, and therefore it is very simple to add new sequences to the multiple alignment. Also, once the model is found, the multiple alignment is very fast and easy to produce.

The program to produce the basic alignment is called align2model. Calling it with no arguments gives a brief explanation. To align all the sequences in trna10.seq use the command:


align2model trna10 -i test.mod -db trna10.seq
This will put an intermediate form of the alignment in the file trna10.a2m. In this FASTA-compatible intermediate format deletions are shown as dashes (`-') and insertions (produced in the insert states of the model) are shown as lower case characters, while periods (`.') are used to fill in the sequences that did not have any insertions if a2mdots is set to 1, the default.

In case you only want to align a few sequences in a large file, you can specify the identifiers of these sequences on the command line. For instance


align2model trna2 -i test.mod -db trna10.seq -id TRNA1 -id TRNA9
will only align the two specified sequences. The output (in trna2.a2m) would look like this:

>TRNA1
ggGGAUGUAGCUCAG-.UGGUAGAGCGCAUGCUUCG.CAUGUAUGAGGcC
CCGGGUUCGAUCCCCGGCAUCUCCA----
>TRNA9
cgGCACGUAGCGCAGCcUGGUAGCGCACCGUCCUGGgGUUGCGGGGGU.C
GGAGGUUCAAAUCCUCUCGUGCCGACCA--

The not_id option can be used to exclude specific sequences from being aligned, and has precedence over the id option.

The Viterbi algorithm finds the single best path for the alignment -- that which has the highest probability. The SAM system can also find the single most likely path. Internally, this corresponds to using the EM algorithm to evaluate all possible paths, calculating the probability of each residue appear in each model state and the probability of use for each transition (the posterior probabilities). A second (Viterbi-style) dynamic programming is then performed on these probabilities to find the most likely path (see, for example, Holmes and Durbin in Recomb98).

SAM includes two flavors of posterior-decoded alignment. When adpstyle is 4, posteriors are calculated using a global (SW is 0) forward calculation. Then, the minimizing path through the transition and character emission posteriors is calculated using the specified SW mode.

When adpstyle is 5, posteriors for character emissions are calculated using the specified global or local SW mode. Then, the minimizing path for character emission posteriors only is calculated using a global dynamic programming mode.

These alignment methods are much slower than standard Viterbi alignment, and have not yet been implemented for reduced memory use. The maxposdecodemem specifies the numbers of bytes that may be allocated in to this calculation. If the sequence is too long to perform this alignment calculation in the space specified by maxposdecodemem, a warning messages printed and the Viterbi algorithms used.

10.1.1 prettyalign

To get a nice display of the alignment produced by align2model, you can use the prettyalign program, which has several display options. The program reads from a file like the one made in the example above:


prettyalign trna10.a2m > trna10.pretty
which would give you an alignment similar to the one shown in the Section 3

Prettyalign does not follow SAM's normal commandline format. To see an explanation of the various options, run the program with some invalid option (like prettyalign -h). Some of the most useful options are

-f
Print in a FASTA-like format.
-i
Do not include sequence identifiers in front of each line.
-l num
Set the output line length equal to num.
-n
Toggle indexing the sequences, as well as labeling them.
-c
Toggle column numbering.
-m
Set maximum insertion length (longer insertions are printed as their length).
-I
I-G style alignment. Also sets maximum insertion length very high.
-b
Generate BLAST-style alignments.
The commands

align2model  trna3 -i test.mod -db trna10.seq -id TRNA1 -id TRNA2 -id TRNA9
prettyalign trna3.a2m -l 50 > trna3.pretty
gives the following output


                10            20        30        
                 |             |         |        
TRNA1 ggGGAUGUAGCUCAG-.UGG...UAGAGCGCAUGCUUCG.CAUG
TRNA2 gcGGCCGUCGUCUAGU.CUGgauUAGGACGCUGGCCUCC.CAAG
TRNA9 cgGCACGUAGCGCAGCcUGG...UAGCGCACCGUCCUGGgGUUG

       40          50        60        70                |           |         |         |        TRNA1 UAUG.AGGcCCCGGGUUCGAUCCCCGGCAUCUCCA---- TRNA2 CCAGcAAU.CCCGGGUUCGAAUCCCGGCGGCCGCA---- TRNA9 CGGG.GGU.CGGAGGUUCAAAUCCUCUCGUGCCGACCA--

The prettyalign program can compress long insertions to only the initial segment of bases in the insertion plus digits representing the total length of the insertion. For example, the sequence GacguacguG could be printed out as Ga8guG if 4 was the largest number of insertions that was to be allowed (note that the character 8 is using up one of the positions). By default, insertions of up to length ten thousand are fully printed. This can be changed with the -m flag to prettyalign, which sets the maximum number of insertions that are printed. If set to zero, no insertions are printed, and no indication of the lack is given. If less than zero, insertion characters are not printed, and that number of digits is used to indicate the length of each insertion.

The -I switch will create a compatible IG-style alignment file which may be converted to other formats using the readseq package included as a subdirectory of SAM. The -I option automatically sets a high value for the insertion length parameter.

The -b switch generate a BLAST-style output, comparing the first sequence in the alignment (by default) to each subsequent sequence, and generating pairwise alignments with a middle line highlighting identical residues and conservative substitutions. These pairwise alignments do not include terminal inserts: These pairwise alignments run from the first to last residue aligned by either sequence. They do not include terminal inserts; they include insernal inserts only if some sequence in the pairwise alignment is inserting residues. The -R option can be used to specify a different sequence to use as this reference sequence.


10.1.2 Aligning fragments and SW

Consider a model of 100 nodes and a fragment of 25 that very closely matches some contiguous section of the model. Even though that section would align very well, the overall alignment of the fragment could be quite poor because of its need to use 75 delete states in the model. The problem in this example of global alignment is that in addition to modeling conserved regions, the model also models the length of the conserved region.

SAM's SW variable can be used to control the alignment type. Global alignments can be forced by setting SW to 0. The default (2) is local alignment similar to Smith and Waterman method of sequence comparison, which will find the best alignment for any pair of subsequences within two sequences. The same can be done with models, allowing a submodel to match a subsequence. This type of dynamic programming specified with SW 2, the default. In this case, sequences can jump from the initial module (presumably a FIM, automatically added when auto_fim is set) into the match state of any module in the model, and can also jump out of the match state of any module within the model to the delete state of the next-to-last node. The first and next-to-last module are assumed to be FIMs, hence the rational is that a sequence will use the FIM for some period of time to consume characters that do not match the model, then the sequence will jump to the model node corresponding to the start of the fragment, use several model nodes, and then jump to the ending FIM to consume the rest of the sequence.

The probability of these jumps is set by the variables jump_in_prob and jump_out_prob, both of which have a default value of unity. That is, as in the sequence-to-sequence Smith and Waterman, there is no cost associated with jumping in and out of the model.

Setting SW 1 specifies a semi-local alignment. This option allows a sequence to start matching the model at any location (rather than only the begin node) and end at any location (rather than only the end node). This will improve alignment for short sequences that match a segment of the model. (Internally, no FIMs are added to the model and jumps are allowed).

For domains, SW can be set to 3 to match the full model to part of the sequence. (Internally, FIMs are added to the model but jumps are not allowed).

The file trna1frag.seq contains several sequences that contain part or all of TRNA1. The sequence include TRNA1 (72 bases), TRNA1Long (the complete tRNA with additional characters), Long (58 base segment of TRNA1), Medium (34 base segment), Short (6 base segment TRNA1), and AAMediumA, an embedding of Medium within several segments of As to bring it to 176 characters. Additionally, the file contains several (obviously) non-tRNAs of various lengths, all of whose IDs begin with the word `Not'.

When this file is aligned to the model test.mod, created above, the alignment of the sequence and fragments is reasonable, but the non-tRNAs still align the entire model and may even use internal insertion states. (As shall be seen in Section 10.2.4, the scoring of these fragments with the SW option off is not nearly so good as their alignments).


                              10        20        30        40         50  
                               |         |         |         |          |  
TRNA1         gg......GGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
TRNA1Long     aaa13aggGGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
Short         ........-----------------CAUGC---.----
ShortReverse  u.......--CGUA-------------------.----
Medium        ........----------GAGCGCAUGCUUCGCAUGUAUGAGG.CCCCGGGU
MediumReverse uug20uac--------------GCUUCGUACGCGAG--.----
AAMediumAA    aaa70aaa---------AGAGCGCAUGCUUCGCAUGUAUGAGG.CCCCGGGU
Long          ........----GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
NotShort      a.......----------------------.----
Not           a.......----------------------.----
NotLong       a.......----------------------.----
NotExtraLong  a.......----------------------.----

                    60        70                                     |         |                TRNA1         CGAUCCCCGGCAUCUCCA----........ TRNA1Long     CGAUCCCCGGCAUCUCCA----aaa11aaa Short         -------------u....... ShortReverse  -------------c....... Medium        -------------u....... MediumReverse -------------........ AAMediumAA    UAAA-----------aaa68aaa Long          CGAUCCCCGGCAU------........ NotShort      -AAA-----------aaa9aaaa Not           -AAA-----------aaa69aaa NotLong       -AAA-----------aa105aaa NotExtraLong  -AAA-----------aa434aaa

Alignment with the SW option set to 1 is much the same (again, scoring will be improved), though the alignment procedure has managed to better isolate the AAMediumAA sequence's tRNA core, modeled by match states, from its prefix and postfix, modeling by internal insertion nodes.


                                10              20        30        40     
                                 |               |         |         |     
TRNA1         ggG........GAUGUAGCUCAG-UGG......UAGAGCGCAUGCUUCGCAUGUAUGAGGc
TRNA1Long     ..-aaa14gggGAUGUAGCUCAG-UGG......UAGAGCGCAUGCUUCGCAUGUAUGAGGc
Short         ..-........--------......----CAUGCU-------.
ShortReverse  ..-........--------......-------UCGUAC----.
Medium        ..-........--------......-GAGCGCAUGCUUCGCAUGUAUGAGGc
MediumReverse ..-........------UUGGgccccgGAGUAUGUACGCUUCGUACGCGAG--.
AAMediumAA    ..-........--------......--------------.
Long          ..-........---GCUCAG-UGG......UAGAGCGCAUGCUUCGCAUGUAUGAGGc
NotShort      ..-........--------......-------AAAAAAAAAAAAA.
Not           ..-........--------......--------------.
NotLong       ..-........--------......--------------.
NotExtraLong  ..-........--------......--------------.

                  50        60        70                                   |         |         |                TRNA1         CCCGGGUUCGAUCCCCGGCAUCUCCA----........ TRNA1Long     CCCGGGUUCGAUCCCCGGCAUCUCCAAAAAAAAaaaa.... Short         -----------------........ ShortReverse  -----------------........ Medium        CCCGGGUU-------------........ MediumReverse -----------------........ AAMediumAA    --------------AAAAAaa171aaa Long          CCCGGGUUCGAUCCCCGGCAU------........ NotShort      -----------------........ Not           --------------AAAAAaaa68aaa NotLong       --------------AAAAAaa104aaa NotExtraLong  --------------AAAAAaa433aaa

When alignment is performed using the SW option set to 2, only the core tRNA segments are aligned: the non-tRNAs, as well as the prefix and postfix of AAMediumAA are aligned to the FIMs that have been automatically added to the model. The one problem is that the Short sequence has made some use of the end FIM because it is not long enough to make a really significant hit to the model's internal nodes.


                              10        20        30        40         50  
                               |         |         |         |          |  
TRNA1         gg......GGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
TRNA1Long     aaa13aggGGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
Short         ........-----------------CAUGC---.----
ShortReverse  u.......--CGUA-------------------.----
Medium        ........----------GAGCGCAUGCUUCGCAUGUAUGAGG.CCCCGGGU
MediumReverse uug20uac--------------GCUUCGUACGCGAG--.----
AAMediumAA    aaa70aaa---------AGAGCGCAUGCUUCGCAUGUAUGAGG.CCCCGGGU
Long          ........----GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
NotShort      a.......----------------------.----
Not           a.......----------------------.----
NotLong       a.......----------------------.----
NotExtraLong  a.......----------------------.----

                    60        70                                     |         |                TRNA1         CGAUCCCCGGCAUCUCCA----........ TRNA1Long     CGAUCCCCGGCAUCUCCA----aaa11aaa Short         -------------u....... ShortReverse  -------------c....... Medium        -------------u....... MediumReverse -------------........ AAMediumAA    UAAA-----------aaa68aaa Long          CGAUCCCCGGCAU------........ NotShort      -AAA-----------aaa9aaaa Not           -AAA-----------aaa69aaa NotLong       -AAA-----------aa105aaa NotExtraLong  -AAA-----------aa434aaa

A different alignment is produced when SW is set to 2 (fully local) and adpstyle is set to 4 (posterior-decoded alignment with transitions). As can be seen in the example below, this option does not neatly cut the sequences to their matching positions, but may produce a better core alignment.


                              10        20        30        40          50 
                               |         |         |         |           | 
TRNA1         gg......GGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
TRNA1Long     aaa13aggGGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
Short         c.......-----------------------.-.--
ShortReverse  u.......--CGUA-------------------.-.--
Medium        ........----------GAGCGCAUGCUUCGCAUGUAUGAGGC.CC.CGGG
MediumReverse uu......----------GGGCCCCGGA------GUAUG.UA.CGCU
AAMediumAA    aaa63aaa------AAAAAAAAGAGCGCAUGCUUCGCAUGUAUGAGGC.CCcGGGU
Long          ........----GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
NotShort      aaaa....-----------------------.-.--
Not           aaa18aaa-----------AAAAAAAAAAAAAAAAAAAAAAAA.AA.AAAA
NotLong       aaa52aaa-------------------AAAAA-.-.--
NotExtraLong  aa218aaa-----------------A-----.-.--

                     60        70                                      |         |                TRNA1         UCGAUCCCCGGCAUCUCCA----........ TRNA1Long     UCGAUCCCCGGCAUCUCCAAAAAAAAaaaa.... Short         -----------AUGCU........ ShortReverse  -------------c....... Medium        U-------------u....... MediumReverse UCGUACGC---------gag..... AAMediumAA    UAAAAAAAAAAAAA------aaa58aaa Long          UCGAUCCCCGGCA-------u....... NotShort      ---------AAAAAAAAA........ Not           AAAAA-----------aaa20aaa NotLong       -------------aaa52aaa NotExtraLong  -------------aa219aaa

As well as when SW is set to 2 (fully local) and adpstyle is set to 5 (posterior-decoded alignment solely on character emission).


                              10        20        30        40          50 
                               |         |         |         |           | 
TRNA1         ggg.....-GAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
TRNA1Long     aaa14ggg-GAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
Short         cau.....-----------------------.-.--
ShortReverse  ucg.....-----------------------.-.--
Medium        g.......----------AGCGCAUGCUUCGCAUGUAUGAGGC.CC.CGG-
MediumReverse uug18ugu-----------------------.-.--
AAMediumAA    aaa70aaa---------AGAGCGCAUGCUUCGCAUGUAUGAGGC.CCcGGGU
Long          gc......-----UCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
NotShort      aaaaaaa.-----------------------.-.--
Not           aaa36aaa-----------------------.-.--
NotLong       aaa54aaa-----------------------.-.--
NotExtraLong  aa219aaa-----------------------.-.--

                     60        70                                      |         |                TRNA1         UCGAUCCCCGGCAUCUCCA----........ TRNA1Long     UCGAUCCCCGGCAUCUCCAAAAA--aaaaaaa. Short         -------------gcu..... ShortReverse  -------------uac..... Medium        -------------guu..... MediumReverse -------------acg16gag AAMediumAA    UAAA-----------aaa68aaa Long          UCGAUCCCCGGC-------au...... NotShort      -------------aaaaaa.. Not           -------------aaa37aaa NotLong       -------------aaa55aaa NotExtraLong  -------------aa219aaa

Here is the posterior-decoded (adpstyle 4) global alignment (SW is 0).


                                10        20        30        40         50
                                 |         |         |         |          |
TRNA1         ggG........GAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
TRNA1Long     aa-aaa12gggGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
Short         ..-........----------------------.---
ShortReverse  ..-........---------------UCGUAC-----.---
Medium        ..-........---------GAGCGCAUGCUUCGCAUGUAUGAGGC.CCCGG
MediumReverse ..-........--------UUGGGCCCCGGA------GUAUG.UACGC
AAMediumAA    ..A........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
Long          ..-........---GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
NotShort      ..-........----------------AAAAA----.---
Not           ..A........AAAAAAAAAAAA--AAAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
NotLong       ..A........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
NotExtraLong  ..A........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA

                      60        70                                       |         |                TRNA1         UUCGAUCCCCGGCAUCUCCA----........ TRNA1Long     UUCGAUCCCCGGCAUCUCCAAAAAAAAaaaa.... Short         -----------CAUGCU........ ShortReverse  --------------........ Medium        GUU------------........ MediumReverse UUCGUA--------CGCGAG........ AAMediumAA    AAAAAA------AAAAAAAAAAaa115aaa Long          UUCGAUCCCCGGCAU------........ NotShort      --AAA--------AAAAA........ Not           AAAAAAAAAAAAAAAAAAAAAAAAAAA........ NotLong       AAAAAA------AAAAAAAAAAaaa48aaa NotExtraLong  AAAAAA------AAAAAAAAAAaa377aaa

Here is the posterior-decoded (adpstyle 5) global alignment (SW is 0).


                                10        20        30        40         50
                                 |         |         |         |          |
TRNA1         ggG........GAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
TRNA1Long     aa-aaa12gggGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
Short         ..-........----------------------.---
ShortReverse  u.-........-CG--------------------.---
Medium        ..-........---------GAGCGCAUGCUUCGCAUGUAUGAGGC.CCCGG
MediumReverse ..-........------UU--GGGCCCCGGAGU------AUG.UACGC
AAMediumAA    a.-........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
Long          ..-........---GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
NotShort      a.-........---------AA-------AA-----.---
Not           a.-........AAAAAAAAAAAAA--AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
NotLong       a.-........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
NotExtraLong  a.-........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA

                      60        70                                       |         |                TRNA1         UUCGAUCCCCGGCAUCUCCA----........ TRNA1Long     UUCGAUCCCCGGCAUCUCCAAAAAAAAaaaa.... Short         -----------CAUGCU........ ShortReverse  ------------UAC........ Medium        G------------UU........ MediumReverse UUCGUA--------CGCGAG........ AAMediumAA    AAAAAA-----------aa125aaa Long          UUCGAUCCCCGGCA------U........ NotShort      ----------AAAAAAAA........ Not           AAAAAAA----------aaa20aaa NotLong       AAAAAA-----------aaa58aaa NotExtraLong  AAAAAA-----------aa387aaa


10.2 hmmscore

Any sequence can be compared to a model by calculating the probability that the sequence was generated by that model. Taking the negative (natural) logarithm of this probability gives the NLL score. For sequences of equal length the NLL scores measures how `far' they are from the model, and it can be used to select sequences that are from the same family. However, the NLL score has a strong dependence on sequence length and model length. Hmmscore provides several less biased means of scoring by reporting NLL scores as the difference between a null model and trained model NLL score (a log-odds score, as used in HMMER).

Null model scoring is discussed in more detail in the Barrett, Karplus, and Hughey paper mentioned in the introduction and available from the SAM WWW page.
(http://www.cse.ucsc.edu/research/compbio/papers/nullmod/nullmod.html).

The program hmmscore can find NLL and NLL$-$NULL (log-odds) scores. E-values can be calculated for the reverse sequence null model. The most common operation is to calculate NLL$-$NULL scores for a large number of sequences. This can be done by supplying the name of the model file and one or more sequence database files on the command line, optionally followed by hmmscore parameter specifications. For instance, for the example files described earlier the NLL scores are found the following way


hmmscore test -insert test.mod -db trna10.seq -sw 2

Produces the file test.dist already displayed:


%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  test   Host: peep    Fri Jul 15 13:45:00 2005
%  rph    Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files:  test.mod
% Database Files:  /projects/kestrel/rph/sam32/demos/trna10.seq
%
% Subsequence-submodel (local) (SW = 2)
% Simple scores adjusted by +1.5*ln(seq len) (adjust_score = 2)
% Track 0 FIMs added (geometric mean of match probabilities (6))
% Single Track Model:  test.mod
% Score DP Method: forward all-paths (dpstyle = 0)
% Align DP Method: viterbi (adpstyle = 1)
% 10 sequences, 747 residues, 78 nodes, 0.01 seconds
%
% Sequence scores selected:  All  (select_score=8)
%
% Simple: NLL-NULL using FIM probabilities
% Reverse: NLL-NULL for the reverse sequence NULL model
%    Calculated when Simple < simple_threshold (10000.00)
% E-value on N (=10) sequences:
%    N / (1 + exp(-(lambda(=1.0000) * Reverse)**tau(=1.0000)))
%    Calculated when Simple < simple_threshold (10000.00)
%    Rescale E-values or use -dbsize for multiple scoring runs.
%    WARNING:  E-VALUES ARE NOT CALIBRATED!
% Scores sorted by E-value, best first
%
% Sequence ID   Length      Simple     Reverse     E-value      X count
TRNA7              73       -36.19      -28.50     4.18e-12
TRNA2              76       -35.48      -28.13     6.08e-12
TRNA4              75       -35.66      -27.90     7.65e-12
TRNA9              77       -35.54      -27.82     8.29e-12
TRNA10             76       -35.21      -27.69     9.39e-12
TRNA1              72       -34.85      -27.32     1.37e-11
TRNA8              75       -36.32      -27.26     1.44e-11
TRNA5              73       -35.10      -26.36     3.58e-11
TRNA3              76       -34.43      -26.26     3.93e-11
TRNA6              74       -34.59      -23.81     4.56e-10

As discussed in Section 3.4, the score file contains six columns. The first is the sequence identifier, followed by sequence length, the `NLL-NULL' score using a simple null model. The next score column is either the raw NLL score if only the simple null model is calculated (the default), or the complex or reverse sequence null model's `NLL-NULL' if one of the more time-consuming null models is used, as discussed below. If E-values are calculated, they are listed after the two score columns. Last, the number of all-character wildcards in each sequence is listed for those that have any wildcards.

By default, hmmscore uses the EM scoring method, just as is used to train a model. If desired, scores can be based on exact alignment to the model, multiplying the probabilities along the best path rather than all paths. This method, which corresponds to the forward half of align2model, can be turned on by setting viterbi to 1. Viterbi scoring is appropriate for finding out how good a sequence's best alignment to a model is.

The E-value computation is based on a simple, but somewhat unrealistic, assumption that the scores for the sequence and the reversed-sequence are independent draws from an extreme-value (Gumbel) distribution:

\begin{displaymath}P( \mbox{score} < a) \approx 1- e^(-k e^{\lambda a}) \ .
\end{displaymath}

Subtracting the two scores (derivation not published yet) gives

\begin{displaymath}P( \mbox{diff} < a) \approx {1 \over 1+ e^{-\lambda a}} \ .
\end{displaymath}

The E-value is the expected number of sequences with that good a score, so is simply the probability of seeing that negative a difference, multiplied by the number of sequences scored (or dbsize, if that is specified).

The only parameter that needs to be estimated is the natural scaling $\lambda$. Since we use natural logs in computing our probabilities, we set $\lambda=1$, which seems to be correct experimentally. Thus we compute our E-values with no parameter fitting at all--it is based purely on theoretical considerations.

The calibrate option of hmmscore can be used to calibrate the lambda values for the E-value calculation. Once a model library has been created, it can be specified to hmmscore using the model_library directive (or modlib alias). See Section 10.2.10.

The hmmscore program can also be used to select sequences according to various criteria.

Plots of the NLL-NULL scores can be used to visually look for a break between significant and insignificant matches. See Section 10.11.


10.2.1 NLL-NULL scoring

SAM includes several possibilities for NULL model scoring. The null model can be a simple probability distribution, effectively a model with a single FIM. The null model can be any model specified in SAM format, with the key word `NULLMODEL' (rather than, for example, `MODEL' or `REGULARIZER'), or the first model in a file specified with the nullmodel_file parameter. The null model can be the reversed HMM, or equivalently, the score of the reversed sequence through the original HMM.

To report differences between the model NLL score and the simple null model score (possibly modified by FIM_method_score, see below), set the subtract_null variable to 1. To report differences between two models (for example, one trained on positive family examples and one trained on negative examples of a family), set subtract_null to 3. To report differences between the score of the sequence and the score of the reversed sequence, which provides an automatic adjustment for compositional bias and allows the simple calculation of E-values, set subtract_null to 4 (the default). To report scaled reverse null model scores, set subtract_null to 5. The complex null model (previously subtract_null 2) is no longer supported, and reverts to the simple null model.

Because the difference between the sequence and reverse sequence scores automatically adjusts for model and sequence length, SAM is able to add E-values to the score file in this case. The command


hmmscore testrev -i test.mod -db trna10.seq -subtract_null 4 -sw 2
produces a score file that includes both simple and reverse-sequence null model scores:

%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  testrev   Host: peep    Fri Jul 15 13:45:01 2005
%  rph       Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files:  test.mod
% Database Files:  /projects/kestrel/rph/sam32/demos/trna10.seq
%
% Subsequence-submodel (local) (SW = 2)
% Simple scores adjusted by +1.5*ln(seq len) (adjust_score = 2)
% Track 0 FIMs added (geometric mean of match probabilities (6))
% Single Track Model:  test.mod
% Score DP Method: forward all-paths (dpstyle = 0)
% Align DP Method: viterbi (adpstyle = 1)
% 10 sequences, 747 residues, 78 nodes, 0.01 seconds
%
% Sequence scores selected:  All  (select_score=8)
%
% Simple: NLL-NULL using FIM probabilities
% Reverse: NLL-NULL for the reverse sequence NULL model
%    Calculated when Simple < simple_threshold (10000.00)
% E-value on N (=10) sequences:
%    N / (1 + exp(-(lambda(=1.0000) * Reverse)**tau(=1.0000)))
%    Calculated when Simple < simple_threshold (10000.00)
%    Rescale E-values or use -dbsize for multiple scoring runs.
%    WARNING:  E-VALUES ARE NOT CALIBRATED!
% Scores sorted by E-value, best first
%
% Sequence ID   Length      Simple     Reverse     E-value      X count
TRNA7              73       -36.19      -28.50     4.18e-12
TRNA2              76       -35.48      -28.13     6.08e-12
TRNA4              75       -35.66      -27.90     7.65e-12
TRNA9              77       -35.54      -27.82     8.29e-12
TRNA10             76       -35.21      -27.69     9.39e-12
TRNA1              72       -34.85      -27.32     1.37e-11
TRNA8              75       -36.32      -27.26     1.44e-11
TRNA5              73       -35.10      -26.36     3.58e-11
TRNA3              76       -34.43      -26.26     3.93e-11
TRNA6              74       -34.59      -23.81     4.56e-10

We are presently experimenting with a scaled reverse null model score (subtract_null set to 5). This is computation is inspired by the PSI-BLAST approach. The scaled score is the reverse null model score divided by the average simple null model score per character.

The NLL-NULL scores, especially for the simple null model, are most useful when the model has had free insertion modules (Section 8.5) added to it. Then, the null model and the FIMs will cancel out, and the score will be based primarily on the section of the sequence that matches the region that has been modeled. By default, hmmscore automatically adds FIMs to any model that does not already contain them when SW scoring is used. To change this, if for example you want to ensure that entire sequences are modeled, rather than simply subregions, change auto_fim from its default value of 1 to 0. The auto_fim variable has no effect when a user-specified null model is used.

Since NLL-NULL scores are negative logs, the lower the better. In the case above, all of the tRNAs have been positively identified as tRNAs. (Not surprising as they were all in the training set!)

New to Version 1.2 is the ability to adjust the null model scoring. Since this determines the probability that a sequence was randomly generated according to the residue insertion probabilities, these values should reflect knowledge of the problem domain. Five possibilities are offered. The flat distribution or the background distribution of amino acids over all proteins can be used. Both of these distributions are invariant over all families, and are thus a simplistic assumption. The distribution can also be the distribution of the residues in the training set or the average residue distribution over all columns (match states) modeled by the HMM. The advantage of these two, especially the latter, is the ability to partially correct for compositional bias in the sequences. Lastly, the insertion probabilities can reflect the residue distribution of the sequence currently being scored. This is the most pessimistic null model, as it demands not that the HMM model the sequence better than fixed background frequencies, but that is model the sequence significantly better than frequencies exactly matching the sequence's composition.

So the options available for FIM_method_score are

0
Use the tables present in the model.
1
The relative frequencies of residues in the training sequences (from the Lettercounts node or the training sequences).
2
The relative frequencies of residues in model match states (from the Frequency node).
3
Uniform (flat) probability over all residues.
5
Amino acid background frequencies over all proteins (from the Generic node).
6
Geometric average of the match state probabilities in the model.

The default setting, for experimental and statistical reasons, is the geometric average of the model match states (6). The insertion tables can be similarly modified with Insert_method_score, the default of which is no change (0).

As with the training methods, if the method value is negative instead, the FIMs and insert tables will only be modified if there is no initial model read in (an unlikely occurrence for hmmscore).

The FIM scoring methods are only applied to the primary track of multitrack models. Other tracks never have their insert or FIM tables changed, and if FIMs are added to the model either the character table of the generic node is used (when present) or the geometric match state average (when there is no generic node).

The FIM scoring methods are not applied to the user's null model.

A more detailed discussion of these issues can be found in (Barrett, Hughey & Karplus 1996), mentioned in the Introduction.


10.2.2 Reducing hmmscore runtime

Full evaluation of a hidden Markov model takes time. The dynamic programming algorithm underlying hmmscore and buildmodel requirest time proportional to the product of the length of the model and the total length of the sequences.

The choice of dynamic programming style (dpstyle) also has a dramatic effect on execution time. The default all-paths (EM) method (dpstyle 1) multiplies and adds probabilities represented as log-probabilities in each of the $O(n^2)$ dynamic programming cell calculations. Adding probabilities represented as log-probabilities is internally sped with table-lookup, but still takes time. The single-best-path (Viterbi) method (dpstyle 0) multiplies and maximizes probabilities represented as log-probabilities, where the operations become addition and minimization, and thus is faster. The posterior-decoded (PD) alignment and scoring methods (dpstyle 5) perform an EM pass followed by a Viterbi pass, and use far more memory, and are the slowest. EM scoring is approximately twice as slow as Viterbi scoring, and PD scoring is three times slower than EM because a full forward-backward algorithm (as used in training) is performed followed by the Viterbi algorithm. For training models with buildmodel, Viterbi is about X times faster than EM. The SAM-T04 iterative model building procedure uses Viterbi training and full EM multiple domain scoring with the reverse null model.

The default reverse null model means that two dynamic programming calculations are done for each sequence. The simple_threshold variable can be used to control when the second calculation is performed based on the simple null model. When set to less than 10000, the reverse (or user) null model will only be evaluated when the simple null model score is less than simple_threshold. Sequences for which the more time consuming null model was not calculated will have the score 10000 in the second score column of the distance file. The default value of simple_threshold is 10000 (off); a setting of $0$ is a reasonable setting to halve the running time of database searches, though occasionally some positive E-values may be lost. It is suggested that users experiment on their own data to ensure results are not lost. Runtime for a large database search will be sped by approximately one half.

Independtly, when EM or PD scoring is used, the viterbi_threshold variable may be used to first calculate a single-best-path (Viterbi) simple null model score, and then decide whether or not to perform the EM or PD scoring based on that score. As with simple_threshold, the default value is 10000, and lower values will activate this option. A setting of 0 is also reasonable. This can reduce PD execution time by about ZZ, and EM execution time by about YY.


10.2.3 Selecting sequences, scores, and alignments

Sequences can be selected by hmmscore and placed in a .sel file.

A selection mode is chosen by setting select_seq. If 0, no sequences are selected; if 1, sequences are selected according to their simple null model scores and NLLNull; if 2, sequences are selected according to their column 2 score (user or reverse sequence null, or raw NLL score if subtract_null is 0 or 1) and NLLcomplex; if 4, sequences are selected according to their E-value and Emax; if 8, all sequences are selected. Selection criteria can be combined: 3 requires sequences to score better than NLLnull with the simple null model and NLLcomplex with the complex (user or reverse sequence) null model. Negative numbers indicate that sequences that do not pass the corresponding positive test should be selected.

The following will place labeled copies of all sequences scoring lower than -35 with the reverse sequence null model into test.sel.


hmmscore tests -i test.mod -db trna10.seq -select_seq 2 -NLLcomplex -35 -sw 2

Selected sequences are written out in the same order they are encountered in the database files, which may be different from the order they are listed in the score file if scores are sorted. The sortseq program can be used to write out the sequences in the same order they are listed in the score file. See Section 10.12.8. The sort variable controls whether sequence scores are unsorted (0); sorted by E-value (4), column 2 of the distance file (2), or column 1 of the distance file (1). Sequences in the select file will be selected by other than the sorting criteria unless sort and select_seq are set to corresponding values.

The select_score variable can be set in the same manner as select_seq, in which case only scores of those sequences that match the specified criteria will be recorded in the distance file. This is particularly useful for database searches in which only sequence IDs are of interest.

The select_align variable can be used in a similar way to cause selected sequence alignments to be placed in the runname.a2m file. Note that all selection variables use the same NLLnull, NLLcomplex, and Emax thresholds, though different combinations of the thresholds can be used by the different parameters.

If the binary expansion of the many_files variable includes a '2' (e.g., 2, 3, 6, 7), the score information is sent to standard output. If the binary expansion of the many_files variable includes a '4' (e.g., 4, 5, 6, 7), the multiple domain score information is sent to standard output. (See Section 10.2.5.) If the binary expansion of the many_files variable includes a '1' (e.g., 1, 3, 5, 7), buildmodel will create multiple files. See first two options enable UNIX pipe processing, such as:


hmmscore tests -i test.mod -db trna10.seq | awk  '$1 !~ % { print $5 "\t" $1}'
to print a file with an evalue and an identifier on each line after removing the comment lines. If both options are in force (e.g., 6, 7), both the multiple domain and the simple scoring values will be sent to standard output in an undefined order. In a terminal window, you will also see hmmscore's standard error output of various diagnostic messages.


10.2.4 Scoring Fragments

As with alignments, the SW parameter can be used to set global (0), semi-local (1), local (2), and domain (3) scoring styles. See Section 10.1.2.

The default local scoring is is similar to Smith and Waterman method of sequence comparison, which will find the best score for any pair of subsequences within two sequences. The same can be done with models, allowing a submodel to match a subsequence. With local scoring, sequences can jump from the initial module (presumably a FIM) into the delete state of any module in the model, and can also jump out of the delete state of any module within the model to the delete state of the next-to-last node. The first and next-to-last module are assumed to be FIMs, hence the rational is that a sequence will use the FIM for some period of time to consume characters that do not match the model, then the sequence will jump to the model node corresponding to the start of the fragment, use several model nodes, and then jump to the ending FIM to consume the rest of the sequence.

The probability of these jumps is set by the variables jump_in_prob and jump_out_prob, both of which have a default value of unity. That is, as in the sequence-to-sequence Smith and Waterman, there is no cost associated with jumping in and out of the model.

The file trna1frag.seq contains several sequences that contain part or all of TRNA1. The sequence include TRNA1 (72 bases, TRNA1Long (the complete tRNA with additional characters), Long (58 base segment) TRNA), Medium (34 base segment), Short (6 base segment TRNA1), and AAMediumA, an embedding of Medium within several segments of As to bring it to 176 characters. Additionally, the file contains several (obviously) non-tRNAs of various lengths, all of whose IDs begin with the word `Not'.

When this file is scored using hmmscore with SW set to 0 for global alignment, the scores of the full sequence and the long fragment place them clearly as tRNAs. However, the score of the short and medium fragments are greatly penalized by the large number of delete states they must use.


TRNA1Long           94       -36.25      -28.73     4.01e-12	:TRNA1 with flanking As                            
TRNA1               72       -34.85      -27.32     1.64e-11	:the original sequence                             
Long                58       -27.64      -20.06     2.33e-08	:A long partial segment from TRNA1                 
AAMediumAA         176       -16.81       -9.61     8.04e-04	:The medium segment with long flanking A regions   
Medium              34       -14.62       -7.63     5.84e-03	:A medium-length segment from TRNA1                
ShortReverse         6        -5.77       -0.11     5.68e+00	:A short reverse segment from TRNA1                
NotShort            13        -6.42        0.00     6.00e+00	:A short segment that is not TRNA1                 
Not                 73        -6.84        0.00     6.00e+00	:A string of As of equal length to TRNA1           
NotExtraLong       438        -6.90        0.00     6.00e+00	:A string of As much longer than TRNA1             
NotLong            109        -6.86        0.00     6.00e+00	:A string of As longer than TRNA1                  
Short                6        -5.66        0.11     6.32e+00	:A short segment from TRNA1                        
MediumReverse       34        -7.00        7.63     1.20e+01	:The reversal of medium                            
(Several of the sequences had simple null model scores worse than simple_threshold, so their reverse null model scores were not calculated.)

When scoring is performed using the SW option set to 1, the following score file is generated, which places even the short fragment in the possible tRNA range.


TRNA1               72       -31.14      -37.43     6.69e-16	:the original sequence                             
TRNA1Long           94       -19.56      -28.35     5.88e-12	:TRNA1 with flanking As                            
Long                58       -24.76      -27.13     1.99e-11	:A long partial segment from TRNA1                 
Medium              34       -12.10      -13.03     2.64e-05	:A medium-length segment from TRNA1                
Short                6        -1.47       -0.04     5.87e+00	:A short segment from TRNA1                        
NotExtraLong       438        26.74       -0.00     6.00e+00	:A string of As much longer than TRNA1             
NotLong            109        10.22       -0.00     6.00e+00	:A string of As longer than TRNA1                  
NotShort            13         0.58        0.00     6.00e+00	:A short segment that is not TRNA1                 
Not                 73         8.16        0.00     6.00e+00	:A string of As of equal length to TRNA1           
AAMediumAA         176        13.75        0.00     6.00e+00	:The medium segment with long flanking A regions   
ShortReverse         6        -1.43        0.04     6.13e+00	:A short reverse segment from TRNA1                
MediumReverse       34         0.93       13.03     1.20e+01	:The reversal of medium                            

When scoring is performed using the SW option set to 2, the following score file is generated, which also picks up the sequence AAMediumAA, which is a segment of a tRNA embedded within a longer sequence.


TRNA1Long           94       -36.25      -28.73     4.01e-12	:TRNA1 with flanking As                            
TRNA1               72       -34.85      -27.32     1.64e-11	:the original sequence                             
Long                58       -27.64      -20.06     2.33e-08	:A long partial segment from TRNA1                 
AAMediumAA         176       -16.81       -9.61     8.04e-04	:The medium segment with long flanking A regions   
Medium              34       -14.62       -7.63     5.84e-03	:A medium-length segment from TRNA1                
ShortReverse         6        -5.77       -0.11     5.68e+00	:A short reverse segment from TRNA1                
NotShort            13        -6.42        0.00     6.00e+00	:A short segment that is not TRNA1                 
Not                 73        -6.84        0.00     6.00e+00	:A string of As of equal length to TRNA1           
NotExtraLong       438        -6.90        0.00     6.00e+00	:A string of As much longer than TRNA1             
NotLong            109        -6.86        0.00     6.00e+00	:A string of As longer than TRNA1                  
Short                6        -5.66        0.11     6.32e+00	:A short segment from TRNA1                        
MediumReverse       34        -7.00        7.63     1.20e+01	:The reversal of medium                            

The unclear separation in this file between the best non-tRNA and the worst tRNA segment is because local alignment allows sequences to use only the parts of the model that improve its score. Thus, a long random sequence will have a better score than a shorter random sequence. Better scores are gained by using the more expensive (to calculate) reverse sequence null model. Setting subtract_null to 4 will differentiate the tRNAs, excepting the 6-nucleotide sequence, from the non-tRNAs. (In this contrived example, the long sequences are composed of the single letter A, and thus the sequence and the reverse sequence score the same going through the HMM.)


TRNA1Long           94       -36.25      -28.73     4.01e-12	:TRNA1 with flanking As                            
TRNA1               72       -34.85      -27.32     1.64e-11	:the original sequence                             
Long                58       -27.64      -20.06     2.33e-08	:A long partial segment from TRNA1                 
AAMediumAA         176       -16.81       -9.61     8.04e-04	:The medium segment with long flanking A regions   
Medium              34       -14.62       -7.63     5.84e-03	:A medium-length segment from TRNA1                
ShortReverse         6        -5.77       -0.11     5.68e+00	:A short reverse segment from TRNA1                
NotShort            13        -6.42        0.00     6.00e+00	:A short segment that is not TRNA1                 
Not                 73        -6.84        0.00     6.00e+00	:A string of As of equal length to TRNA1           
NotExtraLong       438        -6.90        0.00     6.00e+00	:A string of As much longer than TRNA1             
NotLong            109        -6.86        0.00     6.00e+00	:A string of As longer than TRNA1                  
Short                6        -5.66        0.11     6.32e+00	:A short segment from TRNA1                        
MediumReverse       34        -7.00        7.63     1.20e+01	:The reversal of medium                            

Significance levels for the simple null model, especially for the second option, change greatly. In the first option, because sequences can start at any location (e.g., the initial FIM) and jump out of any location (e.g., also the initial FIM), no sequence will have scores worse than zero -- even non-family members will have a negative NLL$-$NULL score. However, the significance level will be similar to that of standard scoring.

In the second option, the number of placements of a sequence to the model is essentially the number of starting points of the sequence plus the number of exit points once the sequence has started using the core of the model. That is, sequences start in the initial FIM, may at any time jump anywhere into the model, and then jump out again latter. The effect seen in the significance level depends on both the sequence length and the model length, and for comparison between different models, must be done to the scores themselves as they are being generated. If the adjust_score variable is set to 1 and SW=2, all simple null model scores will have added to them the log of the sum of sequence and model length. If adjust_score is set to 2 (the default) and SW=1 or 2, then all scores will have added to them 1.5 times the log of the sequence length. In the future, the adjust_score parameter may be refined as we further explore the length dependence of scores. The adjustment is not sufficient for models that have internal FIMs. See Section 10.2.1.


10.2.5 Selecting multiple domain alignments

The hmmscore program also can create multiple-domain alignments and score files from selected sequences. Prior to Version 2.1, this feature was called the multdomain program. To enable this option, the select_mdalign parameter is set in a manner similar to other selection parameters. See Section 10.2.3.

For each selected sequence, the multiple domain search procedure will locate copies of a single motif within each selected sequence. A user specified select_md selection variable, along with thresholds mdNLLnull, mdNLLcomplex, and mdEmax is the criterion by which a subsequence is judged to be a match to the model. If select_md is 1, whenever an mdNLLnull simple null model score or better is achieved, a match to the model has been found, and so forth. The default select_md value of 4 uses mdEmax (default of 0.01) to select subsequence matches. The adpstyle paramater determines whether Viterbi (1, default), posterior-decoded with transitions and emissions (4), or posterior-decoded with only character emissions (5) alignment is used.

Once this match is found, it is cut from the sequence and another match is looked for. The process terminates when no matches scoring better than the selection criteria are found. Note that the multiple domain scoring procedure always uses Viterbi scoring. Thus, it is theoretically possible for a sequence to be selected by hmmscore for multiple domain search but for no domain to be found even if the selection criteria are the same.

The output is similar to that of align2model, except that for each match the sequence ID is modified to indicate where in the sequence the match occurred. Additionally, all letters in the sequence that are part of the match are capitalized. Unlike align2model, multiple domain search sequence output does not include periods (`.') as spacers: prettyalign must be used to correctly space the multiple alignment.

As an example, the file multtrna.seq contains two sequences, each of which contains two tRNA motifs. The multdomain feature is used by selecting selecting all sequences for a multiple domain


hmmscore multtrna -i testf.mod -db multtrna.seq -select_mdalign 8 -sw 2
prettyalign multtrna.mult -l90 > multtrna.pretty

the file multtrna.pretty generated is


;  SAM: prettyalign v3.5 (July 15, 2005) compiled 07/15/05_11:25:33
;  (c) 1992-2001 Regents of the University of California, Santa Cruz
;
;          Sequence Alignment and Modeling Software System
;         http://www.cse.ucsc.edu/research/compbio/sam.html
;
; --------- Citations (SAM, SAM-T2K, HMMs) ----------
; R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
;  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
; K. Karplus, et al., What is the value added by human intervention in protein
;  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
; A. Krogh et al., Hidden Markov models in computational biology:
;  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
; -----------------------------------
                                                                      
                                                                      
TRNA12_90:163  ugcuaggggauguagcucagugguagagcgcaugcuucgcauguaugaggccccg
TRNA12_8:77    ugcuagg................................................
TRNA34_104:172 gcuagcguaggcccuguggcuagcuggucaaagcgccugucuaguaaacaggaga
TRNA34_20:87   gcuagcguaggcccugugg....................................

                                                                                                                                              TRNA12_90:163  gguucgauccccggcaucuccaguacugcguugc..............GGCCGUC TRNA12_8:77    ................................................GGAUGUA TRNA34_104:172 uccuggguucgaaucccagcggggccuccagcauaguuugacgggcga--AUA TRNA34_20:87   ................................................----

                10            20        30        40           50                       |             |         |         |            |      TRNA12_90:163  GUCUAGUCUGgauU.AGGACGCUGGCCUCCCAAGCCAGcA.AU.CCCGGGUUCGA TRNA12_8:77    GCUCAG-UGG...U.AGAGCGCAUGCUUCGCAUGUAUG.A.GGcCCCGGGUUCGA TRNA34_104:172 GUGUCAGCGG...G.AGCACACCAGACUUGCAAUCUGG.U.AG.GGAGGGUUCGA TRNA34_20:87   -CUAGCUGG...UcAAAGCGCCUGUCUAGUAAACAGG.AgAU.CCUGGGUUCGA

                  60        70                                                            |         |                                         TRNA12_90:163  AUCCCGGCGGCCGCA----ucgcuu.......................... TRNA12_8:77    UCCCCGGCAUCUCCA----guacugcguugcggccgucgucuagucuggau TRNA34_104:172 GUCCCUCUUUGUCCACCA---guacguagauccgcggc............... TRNA34_20:87   AUCCCAGCGGGGCCUCCAGC--auaguuugacgggcgaauagugucagcgggag

                                                                                                                                              TRNA12_90:163  ....................................................... TRNA12_8:77    uaggacgcuggccucccaagccagcaaucccggguucgaaucccggcggccgcau TRNA34_104:172 ....................................................... TRNA34_20:87   cacaccagacuugcaaucugguagggaggguucgagucccucuuuguccaccagu

                                                              TRNA12_90:163  ............... TRNA12_8:77    cgcuu.......... TRNA34_104:172 ............... TRNA34_20:87   acguagauccgcggc

This file shows the matching area of the sequence within several copies of the sequence. Even though there is an extra FIM state in the model, the required ending delete state is automatically removed from the alignment because auto_fim is at its default value of 1.

If the variable alignshort is set to zero or higher, matching segments of the sequence are clipped, with alignshort positions shown on either side. The IDs are the same as if complete sequences are printed, corresponding to the starting and ending points of the motif within the original sequence. Depending on how large alignshort is, the subsequences may overlap. For example, the commands


hmmscore multtrnas -i testf.mod -db multtrna.seq -alignshort 3
        -select_mdalign 8 -sw 2
prettyalign multtrnas.mult -l90 > multtrnas.pretty
produce the alignment

;  SAM: prettyalign v3.5 (July 15, 2005) compiled 07/15/05_11:25:33
;  (c) 1992-2001 Regents of the University of California, Santa Cruz
;
;          Sequence Alignment and Modeling Software System
;         http://www.cse.ucsc.edu/research/compbio/sam.html
;
; --------- Citations (SAM, SAM-T2K, HMMs) ----------
; R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
;  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
; K. Karplus, et al., What is the value added by human intervention in protein
;  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
; A. Krogh et al., Hidden Markov models in computational biology:
;  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
; -----------------------------------
                          10            20        30        40        
                           |             |         |         |        
TRNA12_90:163  ugcGGCCGUCGUCUAGUCUGgauU.AGGACGCUGGCCUCCCAAGCCAGcA.AU.C
TRNA12_8:77    aggGGAUGUAGCUCAG-UGG...U.AGAGCGCAUGCUUCGCAUGUAUG.A.GGcC
TRNA34_104:172 cga--AUAGUGUCAGCGG...G.AGCACACCAGACUUGCAAUCUGG.U.AG.G
TRNA34_20:87   ugg-----CUAGCUGG...UcAAAGCGCCUGUCUAGUAAACAGG.AgAU.C

                  50        60        70                               |         |         |            TRNA12_90:163  CCGGGUUCGAAUCCCGGCGGCCGCA----ucg TRNA12_8:77    CCGGGUUCGAUCCCCGGCAUCUCCA----gua TRNA34_104:172 GAGGGUUCGAGUCCCUCUUUGUCCACCA---gua TRNA34_20:87   CUGGGUUCGAAUCCCAGCGGGGCCUCCAGC--aua

In addition to multtrna.mult, the file multtrna.mstat is produced. It reports the NLL-NULL score for each of the motifs listed in multtrna.mult. In this case, multtrna.mstat (or multtrnas.mstat) looks like


%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  multtrna   Host: peep    Fri Jul 15 13:45:02 2005
%  rph        Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files:  testf.mod
% Database Files:  /projects/kestrel/rph/sam32/demos/multtrna.seq
%
% Motifcutoff 0.500000
% Motifs selected from sequences into file multtrna.mult if better than:
%	 E-value (mdEmax 1.0e-02) (select_md=4)
% See related information in multtrna.dist
%
% Simple: NLL-NULL using FIM probabilities
% Reverse: NLL-NULL for the reverse sequence NULL model
%    Calculated when Simple < simple_threshold (10000.00)
% E-value on N (=2) sequences:
%    N / (1 + exp(-(lambda(=1.0000) * Reverse)**tau(=1.0000)))
%    Calculated when Simple < simple_threshold (10000.00)
%    Rescale E-values or use -dbsize for multiple scoring runs.
%    WARNING:  E-VALUES ARE NOT CALIBRATED!
% Scores sorted by E-value, best first
%
% Sequence ID     Length      Simple     Reverse     E-value      X count
TRNA34_104:172       69       -32.06      -30.14     1.63e-13
TRNA12_90:163        74       -31.41      -29.78     2.34e-13
TRNA12_8:77          70       -30.25      -27.98     1.42e-12
TRNA34_20:87         68       -29.98      -27.46     2.37e-12

Because reliable results are only obtained if FIMs are added to the model, multiple domain searchers are best performed when auto_fim is set to 1 (the default).

If the binary expansion of the many_files variable includes a '4' (e.g., 4, 5, 6, 7), the multiple domain score information (.mstat) is sent to standard output instead. See Section 10.2.3.


10.2.6 Multi-track HMM scoring

Suppose one wanted to model a group of protein sequences with associated secondary structure information. The secondary structure labels are important information that could lead to better modelling. To enable such analysis (without the need, for example, of a 60 character alphabet of each amino acid with each secondary structure type), SAM can process multi-track sequences and multi-track HMMs.

A multi-track sequence is specified by a pair or set of sequence files. The first sequence file contains the first track of data (amino acid sequences in this case), while the second sequence file contains the second track of data (secondary structure) with each sequence corresponding exactly in length, identifier, and position to a sequence in the first file. The sequences are specified as a comma-separated list with the db paramater.

Multi-track HMMs are similarly paired or sets of SAM HMM files. The models are specified as a comma-separated list for the -trackmod filename list variable. During the dynamic programming calculation, all transition probabilities are taken from the first or primary track, while character emission probabilities are taken from the joint (product) probability of the first track character being emitted from the given first track HMM state, and the second track character being emitted from the given second track HMM state, and so on. All track models must, of course, be of the same length.

By default, the tracks are all given unity weight, so that, ignoring transition probabilities, a 2-track model will produce scores of about twice the magnitude of a 1-track model. To compensate for this, each track can also be given a character emission coefficient with trackcoeff. The scores (log-probabilities) of the match and insert states are scaled by this ($\geq 0.0$) value, so that a 2-track model with identical tracks and coefficients of 0.5 will produce the same scores (approximately, due to rounding) as a one-track model, such as with the following two commands:


hmmscore t2  -trackmod test.mod,test.mod -db trna10.seq,trna10.seq
             -a RNA,RNA -trackcoeff 0.5,0.5
hmmscore t1 -i test.mod -db trna10.seq

if a mixture of user-defined and standard alphabets are used, the hyphen (`-') can be used in place of the normal character list in an alphabetdef statement to indicate a standard alphabet. For example:


hmmscore t2  -trackmod test.mod,test.mod -db trna10.seq,trna10.seq
             -alphabetdef "RNA -","RNA -" -trackcoeff 0.5,0.5

Multi-track models are easy to create with external software scripts.

The additional tracks have the same SW mode preprocessing (i.e., possibly adding FIMs), but never have their insert or FIM character tables changed (i.e., Insert_method_score and FIM_method_score are ignored for tracks other than the first). If FIMs are added to the model either the character table of the generic node is used (when present) or the geometric match state average (when there is no generic node).


10.2.7 Distributed scoring

Scoring a large database with hmmscore can take several hours on even the fastest workstation. The scoring program includes primitive support for distributed scoring. If the segments variable is set to an integer larger than 1, hmmscore assumes that that many runs of hmmscore are being used to score a complete database. The segment_number variable is used to label each segment.

These parameters might be used as follows:


hmmscore  test1 -i test.mod -db bigdatabase -segments 2 -segment_number 1 -sw 2&
rsh othermachine hmmscore test2 -i test.mod -db bigdatabase -segments 2 -segment_number 2 -sw 2
wait cat test1.dist test2.dist > test.dist

The associated parameter, segment_size, specifies that number of sequences that are read in at a time. At a value of 100, two segments would produce the effect that the first 100 sequences are processed by segment 1, the next 100 be segment 2, the next 100 by segment 1, and so on. Note that workload is partitioned according to the number of sequences rather than the number of residues, some segments may take longer to complete than other segments.

The segment_size parameter is important even if distributed scoring is not being performed to reduce memory consumption by hmmscore. Unfortunately, selection based on E-values requires knowledge of the database size. For this reason, if selection is being performed, segmented file reading will be performed twice; the first time to calculate the size of the database, in the second time to perform the scoring. The dbsize variable can be set to externally indicate the size of the database for calculating E-values. When set, the first reading of the database is not performed.


10.2.8 Single Sequence Smith & Waterman Scoring and Alignment

The hmmscore program can be used to perform Smith and Waterman scoring, alignment, and multiple domain alignment. This option is achieved by internally creating a SAM model based on a single query sequence, gap and continue penalties, and a scoring matrix. These models cannot be stored, and can only be used with Viterbi scoring. Local or global alignment can be performed on this model by appropriately setting sw.

The internal model zeros the appropriate transitions and uses an apporpriately simplified dynamic programming calculation. Its dynamic programming speed is about the same as a dedicated Smith and Waterman program, though the default reverse-sequence e-value calculation doubles execution time.

Smith and Waterman mode is specified by including a query file on the command line. The first sequence in this file is taken to be the query sequence. Additional parameters include gap and continue, which are by default set to 12 and 1, respectively, and matrix, set to ``blosum62.'' Matrices are read in BLAST format. They are first looked for in the current working directory, then in the directory pointed to by BLASTMAT, if set. If not, the subdirectories aa and nt, depending on alphabet, of the environment variable PRIOR_PATH are checked, and finally, the same subdirectories of the default prior path compiled into the code. See Section 8.1.1.

As with other forms of scoring, SAM is able to calculate e-values for Smith and Waterman queries based on the reverse sequence in all model. For this to be effective, the system must know the scaling factor for comping E-values based on how the scoring matrix has been formed. The value swlambda should be set to $\ln(b)/u$, where $b$ is the base of the matrix and $1/u$ is the unit weight. For example, the Hennikoff & Hennifkoff ``blosum62'' matrix distributed with BLAST and included with SAM is in units of 1/2 bit, so $b=2$ (bits are base 2) and $u=2$, since a `2' in the matrix corresponds one unit (a bit, in this case). The default value of lambda is thus $\ln(2)/2=0.34657$.


hmmscore sw -query globins50 -db globins50              -matrix blosum62 -gap 12 -continue 1 
will produce a score file using the first sequence of globins50 as a query and the first 10 sequence of globins50 as a database.

%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  sw   Host: peep    Fri Jul 15 13:45:02 2005
%  rph   Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files: 
% Database Files:  /projects/kestrel/rph/sam32/demos/globins50.seq
%
% Smith & Waterman Query Sequence:  BAHG$VITSP
% Cost matrix:  /projects/kestrel/rph/sam32/lib/matrix/aa/blosum62
% Gap cost 12    Continue cost 1
% Score DP Method: viterbi single-path (dpstyle = 1)
% Align DP Method: viterbi (adpstyle = 1)
% 50 sequences, 7369 residues, 0 nodes, 0.08 seconds
%
% Sequence scores selected:  All  (select_score=8)
%
% Simple: NLL-NULL using FIM probabilities
% Reverse: NLL-NULL for the reverse sequence NULL model
%    Calculated when Simple < simple_threshold (10000.00)
% E-value on N (=50) sequences:
%    N / (1 + exp(-(swlambda(=0.3466) * Reverse)**tau(=1.0000)))
%    Rescale E-values or use -dbsize for multiple scoring runs.
%    WARNING:  E-VALUES ARE NOT CALIBRATED!
% Scores sorted by E-value, best first
%
% Sequence ID   Length      Simple     Reverse     E-value      X count
The various select options (e.g., selectalign, selectmdalign, etc.) can also be used, according to Smith and Waterman score (NLLnull and mdNLLnull), reverse sequence (NLLcomplex and mdNLLcomplex, or e-value (Evalue and mdEvalue) score.


10.2.9 Scoring using the Kestrel parallel processor

The hmmscore program can use the UCSC Kestrel parallel processor to perform high-speed scoring of sequence databases.

Currently hmmscore only supports global or local EM scoring on Kestrel and models that have a final length, with added FIMs, of no more than 512 nodes. If any of these restrictions are not met, hmmscore reverts to the sequential algorithm, unless kestrel_fallback is set to 0.

Kestrel scoring is enabled using the use_kestrel parameter. Since a request queue mechanism is not currently available as part of the Kestrel runtime environment, a sleep and retry mechanism is implemented as part of hmmscore. Retrying is controlled using the kestrel_retry_cnt and kestrel_retry_time parameters. The kestrel runtime environment program is executed to access the Kestrel server and is found using the PATH.

Small models will generally score more quickly using the sequential algorithm than with Kestrel. A minimum model size for using Kestrel may be