CMP 244 Homework 3

Due: Wed. May 15

HMMs

  1. Look at the hidden Markov model in the example on page 54 of the text. This an example of a hidden Markov model of a type that runs forever, without an end state. We did not discuss these in class. It doesn't have a begin state either. Change this to an HMM of the type we have been discussing in class by adding a begin state that emits no symbol and that leads to the fair die state with probability 0.99, and the loaded die state with probability 0.01. Add an end state that emits no symbol as well, and add a transition to the end state from both the fair die state and the loaded die state. Make the probability 0.1 for a transition into the end state, and adjust the probability of staying in the loaded die state to 0.8 to make up for this. Adjust the probability of staying in fair die state to 0.85. Now calculate the probability that you get exactly 4 four die rolls, the first 3 tosses are from a fair die, and they are 3, 1 and 5, and the fourth toss is from a loaded die, and it is 6.

  2. Calculate the probability of getting a pair of sixes in two die rolls from the HMM you constructed for the previous question. Note that here I am not specifying the hidden state, i.e. I am not telling you if each die is fair or not. You must consider the possible cases.

  3. Continuing with the same HMM, assume this time you roll two sixes then a three. Fill out a table for the forward algorithm, as described on page 58 of the text, to calculate the probability of this event. Recall that the book uses state "0" for both the begin and the end state of the model.

  4. Coninuing with the previous assumptions, calculate the probability that the second six is from a loaded die, given that you roll two sixes and a three. Show the forward and backward calculations, and the formula you use to calculate this probability.

  5. Read the paper "An Introduction to hidden Markov models for biological sequences" by Anders Krogh. He has an electronic copy on his publications page. In that paper a genefinding hidden Markov model that includes introns is given in Figure 4.12. Where it says "to stop model" and "from start model" he is referring to the parts of the model discussed previously for the case where there are not introns, as shown in Figure 4.11. To model genes with introns, what was the "coding region" model in Figure 4.11 is replaced with the entire Figure 4.12.

    In a correct gene interpretation, if you splice out the introns, and read the bases from the start to the stop codon, the number of bases should be a multiple of three, and there should be no in-frame stop codons (TAA, TAG, or TGA) before the real stop codon. Each of the codon models in this HMM, denoted by "ccc", use higher order Markov models in their states so that, e.g., the probability of the base in the last position of the codon depends on the previous two bases in the input sequence. The probability distributions are set so that the probability of a TAA, TAG, or TGA in such a codon model is zero.

    For any input DNA, this HMM assures that for every path, if you splice out bases considered to be in introns, the number of bases between the ATG state in the start codon model and the stop codon in the stop codon model is a multiple of three. However, it turns out that this model can sometimes still mistakenly allow a nonzero probability path for an input sequence that has an in-frame stop codon. Explain how this could happen. What would you propose to fix this, if anything? If you doubt this is possible, try to construct an input for the genefinder described in this article, HMMgene web server, that it interprets as having an in-frame stop codon. I haven't tried this, but I suspect you will need lots of good codons after the in-frame stop codon to tempt it to go on interpreting bases as coding.)

Last modified May 5, 2002.

Back to the CMP 244 Class Page.