CMPS 243 Homework 2, Winter 2002
(Last Update:
00:43 PST, 15 February 2002
)
Due: Monday 25 Feb 2002
Estimating distributions of amino acids
This exercise is to test your understanding of estimating discrete
probability distributions given a small sample. Apply each method to
the following seven count vectors:
- 1 isoleucine (rest zero)
- 1 isoleucine, 1 valine (rest zero)
- 1 isoleucine, 1 phenylalanine (rest zero)
- 1 isoleucine, 1 histidine (rest zero)
- 1 isoleucine, 1 aspartic acid (rest zero)
- 2 isoleucines (rest zero)
- 3 isoleucines (rest zero)
In addition to reporting the seven probability vectors estimated from
the counts, also report the entropy (in bits) for each.
- Maximum-likelihood method. Normalize each count vector to
probabilities and compute the entropy. You many want to write a short
program to compute the entropy, since the summation is tedious to do
on a hand calculator.
- Pseudocounts (Dirichlet prior). Use the following three
pseudocount vectors and compute the mean posterior estimate and the
entropy for each. You may want to write a short program.
- all pseudocounts 1 (the uninformed prior)
- all pseudocounts 0.05 (a prior that favors conserved residues)
- Order = A C D E F G H I K L M N P Q R S T V W Y
0.0943893 0.0193798 0.0643931 0.0721827 0.0430317 0.0759875 0.0257852
0.0654119 0.0713842 0.0900674 0.0263642 0.0551132 0.0458893 0.047656
0.0544527 0.0767514 0.0693522 0.0798881 0.0136764 0.0385562
(This is available in /projects/compbio/lib/rev4.1comp)
- Dirichlet mixture. Write a program (in any programming language
that provides lgamma) to compute the mean posterior estimate from a
count vector and a Dirichlet mixture. Use the Dirichlet mixture
recode3.20comp
(also available in /projects/compbio/lib/recode3.20comp).
Note that the "DirichletReg" format provides 21 numbers in the the
"Alpha" vector. The first number is the sum of the 20 pseudocounts,
and can be ignored.
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building