UCSC BME 100 Fall 2002
Intro to Bioinformatics
(Last Update:
10:15 PST 11 March 2003
)
This is a required course for bioinformatics B.S. majors and is
highly recommended for new graduate students (before taking CMPS243
or CMPS244). In fact, since CMPS 243 is likely to be quite a
different course this year, as new faculty member Carol Rohl is taking
it over, BME 100 has become even more important for new grad students.
For catalog copy and pre-requisites, see the
main page for BME100.
Who, When, and Where:
Instructor: Kevin Karplus (
karplus@soe.ucsc.edu) http://www.soe.ucsc.edu/~karplus
Office hours: Th 11--12 315B Baskin Engineering
(This is subject to change, if it conflicts with too many
student schedules.)
TA: None this year.
Lectures: MWF 3:30-4:40 Social Science II, Room 137
One lab section a week is required:
F 11:30am Baskin Engineering 105
You must register for the lab and the course together---neither can be
taken without the other. WARNING: we may not get the times and
locations in the Registrar's time schedule---labs tend to get moved in
the first week to accomodate demands on Baskin Egnineering 105 and to
avoid TA schedule conflicts.
Although you must register for the lab, attendance at lab sections
is optional---it will be a time when I'll be in the lab to help out
with Perl questions, with bioinformatics tools on the web, with
debugging, and with general help with the homework assignments.
Texts
There will be two required texts, plus additional readings that will
be distributed either on paper or via the Web:
- Programming Perl
Larry Wall, Tom Christiansen & Jon Orwant
latest edition
O'Reilly and Associates
- Considered the best single book on PERL---this is the
main reference work on the language, and every PERL programmer
should have a copy of it handy.
You may use other PERL tutorials or references, but I expect
you to have easy access to this one.
We will be covering just the basics of PERL, not open-source
packages like BioPerl, which you may wish to learn on your own.
-
Biological Sequence Analysis: Probabilistic Models of Proteins and
Nucleic Acids from Cambridge University Press by
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.
-
This book is a tutorial introduction to the use of hidden Markov
models and other probabilistic models for sequence analysis problems
in computational molecular biology, but is aimed more at a
gradauate-student audience. We've been using it for years in the
both graudate bioinformatics courses, but Fall 2002 marks the first
year for using it in the undergraduate course. This is a grad text
and reference
book that every bioinformatics programmer should have.
(Be sure to look at the errata
page.)
- Darling models
-
I am planning to add some assignments this year to build physical
models of peptides (and maybe DNA base pairs) using the Darling model
kits. These kits are
available over the web at http://www.darlingmodels.com/
I recommend getting the "NMSU Biochemistry" kit, though the cheaper
"protein alpha helix--pleated sheet" kit may suffice.
I have found these kits to give me a much better insight into protein
flexibility and rigidity than the standard ball-and-stick models used
in organic chemistry classes, and they are fun to play with.
To reduce costs, it is quite reasonable for students to share a kit.
Some initial instructions for building a protein backbone with this
model kit are now available.
Evaluation
There will be two types of assignments for the course and two types
for the lab.
The course will have reading assignments and pencil-and-paper
exercises; the lab will have programming exercises to learn PERL and
bioinformatics exercises using real data.
The same grade (and evaluation) will be given for both the course and
the lab.
Based on the first running of the course in Fall 2001, there will be
no exams.
It turns out to be very difficult to make up small enough problems for
examination---almost all the homework exercises are much larger
problems than could reasonably be given on a timed exam.
The assignments will be distributed on the web (see http://www.soe.ucsc.edu/~karplus/bme100/f02/homework.html).
The relative weights of the different types of
assignment in the evaluation has not been determined yet---it should
be roughly proportional to how much time the different assignments
take to do well. We will try to assign points to each assignment as
it is given, but the total number of points won't be known until we've
created all the assignments.
Academic Integrity
Anyone caught cheating in the class will be reported to their college
provost (see UCSC
policy on academic integrity) and may fail the class. Cheating
includes any attempt to claim someone else's work as your own.
Plagiarism in any form (including close paraphrasing) will be
considered cheating. Use of any source without proper citation will
be considered cheating.
Collaboration without explicit written acknowledgement will be
considered cheating.
Collaboration on lab assignments with explicit written acknowledgement
is encouraged---guidelines for the extent of reasonable collaboration
will be given in class.
Rough list of topics we'll probably cover (not necessarily in order)
Note: list has been updated throughout the quarter to reflect what
really happened.
- Quick review of the fundamental dogma of biology:
DNA->RNA->protein, bases, codons, amino acids
(3.5 lectures)
- Stochastic models, Bayes Rule, 0-order Markov chain
(0.5 lecture)
- First-order Markov model, pseudocounts.
(1 lecture)
- Converting abitrary scores to stochastic models: P-value and E-value.
Brief discussion of Z-scores (Gaussian dist.) and fat tails of
extreme-value (Gumbel dist.)
(1 lecture)
- Interpreting classification results: true/false positives,
specificity, sensitivity, ROC curves
(1 lecture)
- Entropy, relative entropy, sequence logos.
(1 lecture)
- Mutual information
(1 lecture)
- Substitution matrices and sequence alignment scores.
(1 lecture)
- Aligning sequences to sequences, dynamic programming
We'll do the the simple, but inefficient algorithm (for
aribtrary gap costs) first.
(1 lecture: the alignment problem and global dynamic programming with
arbitrary gap costs)
(1 lecture: global dynamic programming with linear gap costs,
traceback)
(1 lecture: affine gap costs. Global and local dynamic programming)
- Introduction to Hidden Markov models
(1 lecture on HMMs)
(1.5 lectures on profile HMMs giving Viterbi algorithm)
(Could have been based on Rachel
Karchin's lecture (powerpoint slides), but weren't.)
- Training HMMs (one lecture)
(forward-backward algorithm)
- Multiple alignment techniques
Overview and progressive alignment (1 lecture)
T-Coffee (1 lecture)
paper on T-coffee:
T-Coffee: A novel method for fast and accurate multiple sequence alignment.
Notredame C, Higgins DG, Heringa J.
J Mol Biol 2000 Sep 8;302(1):205-17
- Library databases (training session by library staff).
New access method for PUBMED/MEDLINE, also covering BIOSIS and
"Web of Science" (Science Citation Index). (1 lecture)
- Protein secondary structure (DSSP and STRIDE), mutual
information, and entropy, in order to explain second track of
2-track HMM.
Discuss secondary structure prediction using neural nets (2 lectures).
- Phylogeny: brief mention of maximum-likelihood and parsimony.
Additivity assumption.
UPGMA algorithm presented, ultrametric assumption and molecular
clocks, intro to neighbor-joining (no proofs)
(1 lecture)
- RNA structure and Stochastic Context-Free Grammars
- Revisiting traceback in sequence=sequence alignment (everyone
had it wrong in the homework assignment, so I must have done a
poor job of explaining it the first time).
- Combining secondary structure, fold-recognition, and
new-fold methods for protein structure prediction.
Using the
transparencies to be given at Schloss Dagstuhl
I could have handed out
book chapter on SAM-T2K, but didn't.
- Guest lectures. (I will be gone for the last 5 lectures of
the quarter, but I have arranged some top-notch guest lectures to cover
for me.)
- Bill Scott: X-ray crystallography (2 lectures)
- David Haussler: topic to be determined.
- Melissa Cline
- Hui Wang: alternative splicing
Also, don't forget the
Second Biennial
UCSC-QB3 Symposium on Bioinformatics:
Predicting the structure and function of proteins
which will be just after the end of the quarter (Sat and Sun 7--8 Dec 2002).
Rough list of topics we didn't have enough time to do more than
briefly mention:
Other resources on the web
-
Handouts for
Rune Lygsoe's summer 2002 course on bioinformatics
-
User's Guide to the Human Genome (in nature genetics).
SoE home
UCSC Bioinformatics Home Page
BME 100 home page
Questions about page content should be directed to
Kevin Karplus
Computer Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250