Hidden Markov models were constructed using the REP99-gxn set as a seed. Cross-training was done using a randomly chosen half of the sequences for training and the rest for cross-training. The model that minimized cross-training cost divided by the log of the number of edges was chosen. The HMM chosen, REP99-gxn-cross, has 202 states and 282 edges and compresses the REP99-gxn set to 1.24 bits/base, with almost equal compression of the two halves. A similar script that did not consider larger models came up with the same model, but retuned it on the entire set of sequences, reducing the cost to 1.20 bits/base (REP99-gxn-small). Searching EcoSeq6 with these models takes about 300 seconds on a SparcStation 10, compared to about 32 seconds for searching with a simple Markov model.
A smaller model was also built from the same seed: REP99-gxn-hmm125 built a model with 120 states and 169 edges, getting 1.367 bits/base.
I also tried some more complex scripts that attempted to merge states,
remove unneeded states and edges, and do other model manipulation.
Table 4
summarizes the sizes and cost/base of all
the HMMs tried, and the Appendix lists the scripts used.
The largest model (REP99-gxn-hmm700b) has over half its edges as ``skip edges'' to allow for skipping a base in the sequence. These extra edges increase the cost for the seed set from 0.94 bits/base to 0.96 bits/base, but even this HMM cannot find REP60, which skips three normally crucial bases in the consensus sequence. An HMM that allowed null states (as Krogh's do [5]) might be able to recognize REP60, but Krogh's simple left-to-right models cannot be used for searching for repeated occurrences of a REP in a sequence, and my code has not been rewritten yet to handle null states.
The ``fromseq'' scripts use direct construction of an HMM from the
first sequence in the seed (see Section 2.6
), then
add useful states and edges for the remaining sequences. These models
are somewhat smaller than the models constructed from simple Markov
models, but do not do quite as well on the searches.
Table 4: Sizes and encoding cost of REP99-gxn for the hidden Markov models
built from the sequences of REP99-gxn.
The results of searching with all these models (looking for sequences
that had 27.483 bits better than 1.97 bits/base) were expanded once
with a simple Markov model (27.483 bits better than 1.99 bits/base).
Table 3
summarizes the results for the HMMs
directly and for the expanded sets (with an ``xn'' at the end of the
name).
Some of the ``false positives'' may represent previously unrecognized
REP sequences, and others may be conserved regions adjacent to REPs.
Table 5
lists the sequences that were found repeatedly
by distinctly different searches--all of these look like they are
closely related to REP or REPv sequences. The three sequences that
are not adjacent to an already known sequence are shown in
Table 6
.
Table 5 :
Table of sequences found by distinctly different searches, but not on the
list of known REPs [13].
The codBecoM sequence is adjacent to REPv9, and
the ECOECOA-c sequence is adjacent to REP117.
But the ECOCHLEN-c, uspAeco, and gyrBecoM sequences
do not seem to be near any of the known REPs.
Table 6: The new possible REP sequences or fragments reported in
Table 5
are listed here, using the earliest
start and latest stop position for any of the searches.
Let's look at alignments of the three new potential REP sequences to the REP consensus. The first one is clearly the second half of a REP sequence.
cgcgtcttatcaggcct **************** -RCGYCTTATCMGGCCTAC3'
Looking back a little bit extends the fragment to almost a REP, though the gap in the middle is longer than usual:
aaattg-ctgatg--acgtggcggagtgccgcgtcttatcaggcctggagg
* ****** **** ****************
5'GCCKGATGCGRCGY-----------RCGYCTTATCMGGCCTAC3'
The second seems to contain a REP in the middle:
tggcgcgccttgttacctgat-cagcgtaaacaccttatctggcctacggtctgcgtacgcaatcaaaat
****** * **** ** ***************
5'GCCKGATGCGRCGY--RCGYCTTATCMGGCCTAC3'
The third seems to contain a somewhat corrupt REPv-:
ttttcgtagggcggataagcaccgcgc-atcc
***** * ******* **** ***
GTAGGCCTGATAAGCGTAGCGCGATCAGGC
My HMMs seem to indicate a different consensus sequence for REPv:
YGCCKGATGCGCTACGCTTATCAGGCCTACR
without the C after the second T. The found
sequence is an even better match to the complement of this sequence:
ttttcgtagggcggataagcaccgcgcatcccgacac
****** * ******* ******** * **
YGTAGGCCTGATAAGCGTAGCGCATCM-GGCR