Because both the computational cost of a model and the amount of compression obtainable from an HMM vary with the size, determining the best size to use for a model can be difficult. Usually, we want the best compression we can get for the interesting sequences that aren't already in the training set.
To estimate this, we use a cross-training procedure. In cross-training, the initial set of seed sequences is split into two parts: the training set and the cross-training set. We build and tune models based on just the training set, then check them on the cross-training set, choosing the model that does best on the cross-training set. Note that this differs from cross-validation, where a check is made after all decisions have been made. If cross-validation is desired in addition, then the set of initial seeds must be divided into three sets.
Figure 2
shows a typical plot of cross-training
cost versus the size of the model.
Note that increasing the size of the model, which nearly always
decreases the cost for the training set, eventually starts
overtraining and modeling those aspects of the training set that are
not shared with the cross-training set.
Since we want the smallest model that will get nearly optimal
compression, we are usually interested in a model near the knee of the
curve--say the one that maximizes the savings (in bits per base) relative to the
null model for the cross-training set divided by the
of the number of edges (Figure 3
).
Figure 2: Scatter diagram of
information content (in bits/base) of cross-training set (54 sequences)
versus model size, for HMMs built from the
109 sequences of REP99-gxn (Section 3.1
).
Figure 3: Scatter diagram of the savings relative to a
2-bit/base null model (in bits/base) of cross-training set (54 sequences)
versus model size, for HMMs built from the
109 sequences of REP99-gxn (Section 3.1
).
After choosing the model using cross-training, it can be improved slightly by retuning on the entire initial set of seeds and flattening the probabilities. This preserves the structure of the model, but includes all the data in the tuning.