WARNING to BME100 students: this assignment is similar to, but NOT identical to, the assignment for BME100 last quarter. You will need to redo the hand computations and modify your program.
There is a nice review of dynamic programming on the web at http://www.blc.arizona.edu/courses/bioinformatics/salign.html . Also at that site is a copy of the BLOSUM62 matrix, though you might prefer the output from the program that created blast, as it is machine-readable and makes it clear that 1/2 bit units were used.
There is a nice tutorial on dynamic programming at http://www.sbc.su.se/~per/molbioinfo2001/seqali-dyn.html with pointers to an example at http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html. It uses the more expensive alignment algorithm that allows arbitrary gap costs for different length gaps, though only an affine gap cost is used. If you want a textbook presentation, the best I know is in Chapter 2 of Durbin, Eddy, Krogh, and Mitchison's book Biological Sequence Analysis.
As a reminder, the version of Needleman-Wunsch we presented in class used only the nearest neighbors, and needed 3 matrices:
| Mr, c = max { | Dr - 1, c - 1 + subst(Ar, Bc) |
| Mr - 1, c - 1 + subst(Ar, Bc) | |
| Ir - 1, c - 1 + subst(Ar, Bc) |
| Dr, c = max { | Dr , c - 1 - extend |
| Mr , c - 1 - open | |
| Ir , c - 1 - reopen |
| Ir, c = max { | Mr-1 , c - open |
| Ir-1 , c - extend | |
| Dr-1 , c - reopen |
For the above recurrence, you have to be careful about your boundary conditions. You want to start with a score of 0 at (0,0) [assuming your sequence numbering starts at 1] and with -infinity for all points with negative coordinates. The recurrence relations should then define the rest of the values.
For Smith-Waterman, use the same recurrence relations, but allow 0+subst(Ar, Bc) as an alternative in the Mr,c scoring. The boundary conditions change to -infinity for all points with non-positive coordinates.
I like to display the M, D, and I matrices on the same grid:
|
|
||||||||
|
|
Expand a little on the reasons for the two most common substitutions.
----LADEIAQ--LASEV--AQ
AAAAVAEDVARSTIA-ELSR--
Turn in source code and the output for running the program on the sequences of Questions 3, 4, and 5. It might be a good idea to use some longer sequences as test cases also. You could pick up a sequence from a database and run BLAST to get a sequence to align with it.
|
|
| Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building