Next: Violations of assumption needed
Up: Markov Models as compression
Previous: Hidden Markov Models
Hidden Markov models offer many advantages over simple Markov models
for modeling biological sequences:
-
A well-tuned HMM generally provides better compression than a
simple Markov model, allowing more sequences to be significantly found.
-
The models are fairly readable (at least when drawn rather than just
listed).
A high-quality model for REPs (compressing previously unseen REPs to
about 1.25 bits/base) may have around 200 states and 300 edges,
rather than the
counts of the order-8 simple Markov model.
The low ratio of edges to states means that large parts of the model
are simple straight-line sequences, which are easy to draw and to
understand. -
The HMMs can be used for generating alignments, with each state of
the machine corresponding to one column in the alignment.
The best path found by the Viterbi algorithm identifies a state for
each position, and that in turn can specify the column.
HMMs are a bit more powerful than alignments, since the
same state can be used repeatedly in a path, but each column can only
be used once in an alignment.
This results in ambiguous alignments if a column alignment model is
used, but can be quite convenient for describing phenomena like random
numbers of repeats of a short subsequence.
HMMs also allow variant structures to be modeled directly, not just
as inserts and deletes to a consensus sequence.
For example, the REPv variant of the REP sequence, often found next to
IHF binding sites [9], is modeled very clearly by
REP99-gxn.hmm400m--in fact the IHF binding site itself
occurs frequently enough next to REPs to have been included in the model.
-
Separate HMMs built for recognizing particular structures can be
merged to create HMMs that recognize sequences of
structures [5]. Unfortunately, doing this cleanly
requires a slightly different version of HMMs which allows
null states--states that don't match any characters in the input
sequence. The current version of my HMM code cannot handle HMMs
with null states, but the extension is planned and should be straightforward.
HMMs do have some weaknesses:
Figure 1: Information content of a set of seed sequences is plotted against
the size of the hidden Markov model used for encoding them (expressed
as the number of edges in the model).
The points in the upper curve are from models that have not been
trained--in the lower curve, from models that have been trained and
had useless edges and states removed.
Next: Violations of assumption needed
Up: Markov Models as compression
Previous: Hidden Markov Models
Rey Rivera
Thu Aug 22 14:04:06 PDT 1996