Approximately half the grade for this course will be based on the successful completion of a substantial project. Students are expected to turn in
Each student is expected to have a 30-minute weekly meeting with the instructor. These meetings will be used to discuss the project, aid in debugging, and help with homework.
Sufficiently large projects may be tackled by a group of students, but past experience has favored individual projects. Projects can be supervised by more senior graduate students or faculty other than the instructor---these are often the most successful projects. External supervision is in addition to, not in place of the weekly meetings with the instructor.
This is a list of potential projects for students in Bioinformatics I. There are, of course, many other potential projects, and each student will be required to submit a detailed proposal for the project that he or she intends to do.
The best projects are usually ones that students are most motivated to do, so don't consider yourself constrained by this list---if you have a project idea you would like to pursue, propose it!
We need to create a new fold-recognition server to replace the SAM-T99 server. There are many improvements that we have made in the past couple of years that have not been put into a server yet, and many ideas that have not yet been implemented.
In order to make the server easy to extend, we are planning to use a Makefile (using gnu's version of make) to control the computation. This makes it very easy to add new features to web site, separating the flow of control from the web interface. Rachel Karchin will be creating the web Makefile and web scripts, based on one of the Makefiles in /projects/compbio/experiments/protein-predict/ and on the SAM-T99 web site.
A more modest version of parameter tweaking would not change the underlying multiple alignments, but just the weighting of the amino-acid track and the predicted secondary structure track.
The fold-recognition tests I'm currently using are in ~karplus/pcef/fssp-test-01, and are mainly summarized in the Makefile, though there are descriptions of it in some of my recent talks.
It might be interesting to change the pre-filter method to generate psi-blast profiles from the HMMs, and use psi-blast as a prefilter at each iteration. This should allow us to detect slightly more remote homologs than our current pre-filtering strategy, without enormous increase in cost.
One warning: people think of psi-blast as very fast, but only the blast part is fast. When psiblast has to realign many sequences to a profile, it gets quite slow also. It may be that we only need to use the BLAST part of psi-blast, if we generate psiblast profiles from our HMMs. (Our HMM aligning is no faster than psi-blast's but it does seem to produce better alignments.)
I have several ideas for improving the secondary structure
predictor, some of which are trivial and some of which are difficult.
Some of the difficult ones are already being implemented by grad
students (in their spare time), but a collaboration to get the methods
finished and tested would useful even on these. Here are a few of the
project ideas:
Our current methods are trained to minimize encoding cost
or maximize the Q3 measure (percent of residues with correct
classification). The EVA evaluation system uses the
segment-based SOV score as its primary criterion. We need to
figure out what sorts of errors contribute most to low SOV
scores (missed strands, missed helices, broken helices, broken strands,
merged helices, merged strands, ...). We can then increase
the training weight for particularly important examples (ones
that contribute more to the SOV score being bad), trading off
small losses of Q3 for (we hope) big gains in SOV. The
weighting could be based on the true structure, the predicted
structure, or errors in the prediction.
I have done some testing of different architectures with
one part of this data (training on part 1, crosstraining on
part 2, final test on part 3). Right now it looks like
increasing the number of parameters will cause more
overtraining, but we can distribute the parameters more
efficiently between the different layers. Perhaps, with a
better distribution of the parameters, we could increase the
number somewhat.
Simple idea---use a linear combination of the
probability vectors and optimize the weights. More
complicated idea---create a calibration curve for each method
to rescale the probabilities before averaging (or doing linear
combination).
Another idea: take a linear combination of the logs of the
probability vectors, and convert back to probabilities by
exponentiating and rescaling.
Another idea: provide a tool for re-calibrating the output of
a secondary-structure predictor. That is, produce a monotonic
function from the predictor output to the observed probability
that the prediction is correct. There may even be a separate
function for each of the predictor outputs. The predictor
outputs can then be rescaled by this function, and
renormalized to sum to 1. Combining these re-calibrated
outputs may be more robust than combining the uncalibrated
ones. Note: for the neural net, this recalibration does not
seem to be necessary---when the output of the network is p,
then the fraction of time it is correct is very close to p.
The other techniques we are using do not have such a good
internal calibration. [I believe that Spencer Tu has done
this for his method, but Mark Diekhans method has not been
recalibrated.]
The jurying could be built into the predict-2nd program by
adding a new MultipleNet class that can have combine several
networks that have identical input and output interfaces.
I have built at least a dozen networks for predicting STRIDE EBGHTL
alphabet, trained from scratch on the same training data.
I have not yet
determined whether combining multiple networks would improve
the predictions.
We could investigate several different jurying methods:
We could also create a different HMM-based method that is
only useful for this sort of post-processing. It would be
trained on secondary structure strings alone using labeled
states (no amino acid information and states emit exactly one
of E, H, or L). The probability vectors from the neural net
could be aligned to the 2ry HMM using forward-backward, and
new probability vectors read off by posterior decoding (in
each position use the sum of the probabilities of all states
with a particular label). This should help eliminate the
one-long helices and other ridiculous results that sometimes
come up from residue-at-a-time prediction. If the HMM is well
deigned, it may even be able to look for common patterns, such
as the helix-loop-strand-loop repetition of alpha-beta proteins.
We could also train on a larger (but more redundant) set of
alignments---such as all the SAM-T2K template alignments we've
built (currently about 5575 chains).
To handle both DSSP and STRIDE training data, we need to
modify the input routines, since DSSP and STRIDE assign
secondary structure for slightly different subsets of the
residues. We should probably accept a secondary structure
string that has either the same number of characters as the
number of columns in the multiple alignment (the current
requirement) OR has the same number of characters as the guide
sequence. Then the secondary structure string can be aligned
to the input multiple alignment by pairing it up with the
guide sequence.
Another approach is to build separate predictors for the
STRIDE and DSSP alphabets and report both for the web site.
We could then use a 3-track HMM for fold recognition. This is
the direction that Rachel Karchin and I are currently pursuing.
There are some difficulties in deciding how to handle gaps
in the multiple alignment. The simplest is to use FSSP
representative sequences as the targets, and to use only the
secondary structures that are aligned at a particular residue.
Another approach would be to have the neural net predict a gap
character as well as the normal {E, H, L}, though it is not
clear how that gap character would be used in the multi-track HMM.
The goal of this project is to come up with other structural
features and methods for predicting them. We are looking for a
one-dimensional representation here (some feature for each residue)
that can be used with HMMs. This project can vary enormously in
scope, from quick checks of a single proposed feature set to
Ph.D. dissertations. In fact, Rachel Karchin has proposed a
substantial investigation of local structure features for her
Ph.D. thesis (see /projects/compbio/papers/karchin/thesis-proposal.ps).
Here are some possible
criteria for choosing features:
Here are some ideas for possibly useful
structural features to predict:
The idea here is to look at what the probability is for the
secondary structure predictor predicting a H, E, or L as the most
probable letter in a specific context (such as mid-strand,
beginning of strand, end of strand, ...). The appropriate
probabilities can then be used in building the secondary structure
HMM. The creation of the substitution matrix should probably be
automated so that (1) the definitions of the contexts is easily
changed and (2) new secondary structure predictors can have
substitution matrices calculated for them quickly. The
substitution matrices would actually be quite useful in designing
and debugging secondary structure predictors, so should be a
stand-alone tool.
Another way to create 2-track HMMs for template models is to
use probability vectors in the sequence and labels in the states
(instead of vectors in the states and labels in the sequence). In
this way we could use the predictions of secondary structure for the
target sequence with the known secondary structure sequence for
both the target and the template models. This method would require
some (fairly modest) changes to the inner loop of SAM's hmmscore
program, and a number of I/O changes.
I have created a generator
for amino-acid sequences that I believe creates better
"random" protein-like sequences than standard methods.
This generator is now built into hmmscore.
We could still use a generator for paired sequences (amino acid and secondary
structure) for calibrating multi-track HMMs. There are several
approaches for generating such pairs, and I don't know which will
be best:
We would probably need a set of tools for dealing with these
segment-based HMMs, so that we could train them on real 2ry
strings and build a generator based on them. There may also
be some applications of such HMMs to secondary-structure
prediction, post-processing the output of residue-at-a-time
predictors.
A simpler version of this approach would generate the segment
labels as almost independent draws (prohibiting duplicating
the same letter twice in a row), then use the segment length
distributions to generate the whole strings from the segment
labels.
This approach should work-well for 2-track HMMs (since the
amino acid track and the secondary-structure track have only
about 0.1 bits of mutual information when considered a column
at a time), but not for multi-track HMMs with multiple
variants of the secondary structure alphabet (since these will
be highly correlated).
In addition to generating the sequences (ideally entirely with C
code, so that the sequence generator can be easily incorporated into
SAM), we have to fit the parameters of the E-value computation to the
observed distribution of scores. I'll cover in class the derivation
of the E-value computation we use for the reverse-sequence null, and
show how to fit lambda very easily. We can generalize the fitting,
introducing an extra variable as an exponent on the scores. Fitting
this two-parameter model requires some iteration, but is still fast.
The two-parameter model we are fitting has no theoretical justification.
it would be nice to come up with a heavy-tailed distribution that fit
the data well and had some justification.
Our current best results are from moment-matching to fit the
distribution of database sequences, but it would be nice to use a
maximum-likelihood (or maximum posterior probability) approach instead.
This would probably require a different family of heavy-tailed distributions.
We have found that HMMs built from structural alignments do a
better job of aligning remote homologs (not included in the
structural alignment) than HMMs built from SAM-T99 alignments.
Certainly any method that selects between alignments should
include alignments build from structurally aligned templates.
Note: this annotation is straightforward, but not completely trivial.
The databases generally contain DOMAINS, not whole chains. Therefore,
to avoid incorrect labeling, one has to examine the alignment between
the target and the template and determine which domains of the
template are matched by the target. The SCOP database may be the best
to use for this purpose, though we will have some templates in the
library that are not yet in SCOP.
To avoid this problem, we could try to split the target up into
separate domains, scoring each one separately. The hard part comes in
figuring out where to put domain boundaries. Some possible methods include
One could try several different (possibly even overlapping) domain
divisions, then select the one that gets the best matches to domain templates.
One problem with the product-of-p-values method is that all
the templates in a fold family are treated equivalently. In
practice there may be several templates from one subfamily and
only one from another. Ideally, a scheme should be developed that
can weight each template appropriately.
The algorithm is quite simple: keywords and text are extracted
from the best-matching Swissprot entries of the target and templates,
and dot-products of their frequencies of words (excluding
non-discriminating words) are used as a similarity measure.
Actually, one usually normalizes to a "cosine" similarity measure: The project would be either to re-implement and test the SAWTED
algorithm here or to develop a similarly simple algorithm and test it.
A possible variation: use the SAWTED score as a kernel function for an
SVM multi-class classifier.
The fold prediction method should be evaluated by itself and as a
complement to an existing method (such as SAM-T99).
(Possible Ph.D. project: developing a more sophisticated way to handle
functional information.)
Some preliminary results were obtained last year and this
method looks promising. A possible starting point for a scoring
function is to take the E-value from the 2-track t2k-1-.3-ebghtl
fold recognition results and divide by the "cosine" similarity measure
raised to some power.
We need to look into more sophisticated
representations than just a "bag of words". Possible improvements
include stemming and multi-word phrases. The version last year
only extracted information from the single best-matching Swissprot
entry, when we probably should look at the top several Swissprot
entries.
Another approach to aligning two sequences is to take the
component-wise product of the posterior decoding matrices for
sequence A against HMM B and sequence B against HMM A, and doing
posterior decoding of the resulting matrix.
In particular, we look for features like the compactness of the
resulting model, whether internal strands of beta sheets are missing
(though surrounding strands are present),
whether active-site residues are preserved, whether
identical residues cluster in 3-space, whether they form stripes
across beta sheets, and so forth.
Some of these checks can be automated (such as the compactness test).
It would be good to automate some of the checks, and test them to see
if they improve the fold-recognition capabilities of the SAM-T2K method.
The first checks I would attempt to automate are radius of gyration
and contact order, both of which are easy to compute from a
structure. We'd have to get histograms of these measures for real
structures and for "good" predictions, to see what to expect.
Other 3D scoring techniques (such as threading score functions or
fragment-packing score functions) could also be used to filter the top
hits from the HMM scoring.
One would have to be careful with any scoring function, since the
sizes of the alignments can be quite different and that can affect the
scores more than the quality of the prediction does.
Improve secondary structure prediction
Predicting Local Structure
Parameter tweaking for improved use of local structure prediction
Multi-track HMMS for template models
Calibrating HMMs
Improve template HMMs by using structural alignments as seeds
Labeling template hits by SCOP family and FSSP representative
Improve template selection by dividing into domains
Improve template selection by combining information from multiple templates
Improve template selection with functional information
a.b / (sqrt(a.a) sqrt(b.b))
Improved template selection and alignment with HMM-HMM alignment
Improved target/template alignment
Selecting good target/template alignments
Viewing predictions
If several people pick projects from this list, we will probably have to start using cvs to manage the undertaker source code, or we'll never get all the pieces integrated.
Along the same lines, undertaker currently cannot do much with one of the most common modified residues in PDB files, pyroglutamic acid (PCA) which is usually modified from glutamic acid (GLU). It might be interesting to examine what would would be required to handle it as well. (Of course, we rarely have the information that a particular residue has been modified until after the structure is solved, so I'm not sure when undertaker will be able to use the information in a real prediction.)
This integration is a relatively small project, but other improvements could be made at the same time, such as cleaning up the way that temporary files are created and using the improved PDB reading suggested above. A bigger change would be to provide side-chain replacement without having to call SCWRL. It turns out that one can do almost as well for an initial guess just by using the most probable rotamer from Dunbrack's backbone-dependent library (except for the residues that are identical, where you just copy the residue from the template). It might be good to use Dunbrack's library directly. In any case, the version of SCWRL we are using is a bit out of date, and should probably be replaced by a newer one from Dunbrack.
Also, right now undertaker does not really distinguish between the short fragments found by Rachel's program and the big alignments slicer is really intended for. It might be good to keep the two sorts of chunks in separate data structures in undertaker (probably still using the AlignedFragments class, but with two different variables in the Globals class, and with different operations in the genetic algorithm that uses the alignments or fragments).
If the residues are marked, then the FragmentLibrary will not include those residues in generic fragments. The "rotamer" library would also exclude the bad residues.
In software, we might be better off binning the atom locations and only checking atoms in those bins that are sufficiently close to the center of the sphere we are interested in. I serialized the clash computation (Clash.cc) in this way and got much better performance with the serial version than the parallel one (of course, the parallel algorithm there was a much slower one than the burial computation).
A serial version of the burial function is probably desirable even if it is slower than the parallel one, as it is easier to distribute software than hardware. We could run a serial version on the new cluster, but we don't have 1000 kestrels.
The project here is to add the necessary data structures and operations to handle quaternary structure, and to come up with some heuristics for generating plausible transformations. One simple heuristic would be to determine a transformation from a plausible multimeric template. Another might be to conjecture that certain exposed hydrophobics need to touch.
We could also read in Dunbrack's Backbone-Dependent Rotamer Library and evaluate our score function on each rotamer, and compare our assessment of probability with his. Where there are big discrepancies, we'll probably have to improve the way we handle archetypes, either by using a different representation or by clustering differently. The quality of the rotamer library could depend also on how well "bad" residues are rejected (see the project on rejecting bad residues above).
Undertaker has the ability to call SCWRL to set the rotamers. I have done some informal comparisons of SCWRL and our "rotamer" selection method, and found that SCWRL does somewhat better with the correct backbone. I have not examined which does better when the backbone is a little bit off (a more realistic test). We will need to install the latest version of SCWRL for this to be a fair comaprison.
Using a loop library would probably require having something that does sidechain replacement on the backbone fragment (see the project above about integrating slicer into undertaker).
Note: Baker's group did very well at loop modeling in CASP4 by using fragment packing without a loop library. They did however, have the ability to freeze parts of the backbone and use the fragment packing only on the loop regions. This should not be too hard to add to undertaker.
The max-matching algorithm may have some undesirable side-effects. For example, CYS residues often cluster because they are coordinating a metal ion, rather than because they are in disulphide bridges. This might be fixed by adding another bonus value for cys clusters that are more widely spaced than disulphide bridges (and excluding any known or predicted bridges from being in such clusters).
It would be particularly good if we could do the fitting across several different structures, to avoid target-specific biases. To do this well, we would need a badness function that has similar scaling in different targets. Since the score function is a sum of terms (with terms for each residue), it will tend to grow linearly with the chain length. That means we probably want the badness to grow linearly with chain length also.
For the least-squares fitting, we probably want to include a QR decomposition routine into undertaker. In the past we used lapack++, clapack, and blas, but we might want to find a smaller, more easily incorporated way to do the QR decomposition. For a very simple proof-of-concept, a preliminary version could be done using an external least-squares fitter, but this might make some things more difficult. For example, we might want to fit the weights of the score terms, then generate a bunch more decoys by trying to optimize with the "improved" score function. Certain terms may not have mattered much in the original set of decoys, because the phenomena they were supposed to detect did not occur, or were masked by other problems.
It might be interesting to create one of these other conformation generators, and see how it performs on the same score functions as are used in undertaker. Doing so may require that scoring and conformation generation be a bit more decoupled than they currently are in undertaker, but the code is fairly close to decoupling the generation from the scoring.
A variant of this project for the non-computer people: Pick a protein or protein family whose structure has not been solved yet, and use existing tools (here and on the web) to do the best job you can of predicting the structure, homologs, evolutionary relationships, active site residues, function, ... of the protein. Prepare the result either as a proposal for specific wet-lab work to verify computational hypotheses or as comprehensive web page for a lab that is studying the protein. This project will require a lot of library work and use of web tools (for example, the metaserver at bioinfo.pl), but not much original programming. This is an end-user's view of bioinformatics, rather than a tool-developer's view.
Last year, Lydia Gregoret expressed an interest in creating a web site for "all things cold shock protein related". She said
There might be links to
among other things. An example site on lectins can be found at
http://www.cermav.cnrs.fr/databank/lectine/
This is not exactly what I had in mind, but a start.
The student would learn about the relationship between protein sequence, structure, and function, what kinds of information is out there and also gain experience in scientific web page design.
Lydia is no longer at UCSC, but other researchers in biology or chemistry may want web pages for their favorite protein families---ask around!
Rachel Karchin has used support vector machines (SVMs) to build subfamily recognizers for the subfamilies of the GPCRs (G-protein-coupled receptors). She started with the methods and code developed by Tommi Jaakkola and Mark Diekhans, but has since rewritten the SVM programs in Java for easier modification and portability.
Projects here include building SVMs for classifying other families of proteins (using existing tools), trying to improve the SVMs by reducing the large feature vectors, experimenting with different ways of generating feature vectors, trying other ways to use the feature vectors besides SVMs, ... .
Classifying proteins by SVMs is an attractive problem, because we usually have a fairly small set of positive training examples. The big problem is that we often have an extremely large number of negative examples. We might be able to make faster, more accurate SVMs by doing pre-screening with the HMM used to create the feature vectors. If we pick a threshold for the HMM score that is loose enough to include all the positive training examples and about 100 false positives, and use the SVM only for classifying the sequences that pass this HMM test, then the SVM only needs to learn the difficult cases that really are near the boundary. This should result in smaller classifiers (fewer support vectors) that do as good a job or better and are much faster to train [training time is generally quadratic in the number of training examples].
One particularly useful project would be to build a library of classifiers (or a multi-family classifier) for all the SCOP superfamilies. This could be used as a fold recognizer, followed by HMM-based aligners to produce the alignments. If we could get this automated, I think it would make an excellent new structure prediction server. In version 50 of SCOP there are 934 superfamilies, ranging in size from one structure (160 such superfamilies) to 1822 (for superfamily 2.1.1--immunoglobulins). Other heavily populated superfamilies include lysozyme-like (4.2.1, 648 members), globin-like (1.1.1, 612 members), trypsin-line serine proteases (2.44.1, 510 members), NAD(P)-binding Rossmann-fold domains (3.2.1, 454 members), Thiamin diphosphate-binding fold (THDP-binding) (3.31.1, 421 members). [Note: a SCOP classifier would be of use in the fold-recognition experiments for CASP.]
This project would probably consist of a detailed exploration of the feature vectors for one, difficult classification problem (perhaps one of the GPCR subfamilies that Rachel Karchin has worked on). We would want to look for ways to measure the value of a particular feature, then either weight features or select a subset of them, train an SVM using the modified feature set, and compare its performance with an SVM trained on the full feature vector.
Here are some preliminary thoughts on ways to look for informative features:
Another variant on the radial-basis kernel that I want to explore (perhaps with Rachel Karchin) is replacing the "sigma^2" scaling factor with a separate scaling factor for each vector in the training set: K(i,j) = exp(-0.5 ||fi-fj||^2 / (sigma_i sigma_j)). The scaling factor could be set to the distance from the vector to the closest vector with a different label. For unlabeled (test) data, the scaling factor could be set to the maximum over the possible classes of the minimum distance from the test vector to the training vectors in the class. Having different scaling for different training vectors may help generalize the classification better, since the kernel won't drop to zero so quickly away from the data, except in the neighborhood of the boundary.
One could investigate the relationship between effective family
size on the one hand and percent sequence identity, pairwise sequence
p-values, and other measures of sequence similarity. Characterize
existing databases (Pfam, SCOP, etc.) in terms of the sizes (using
different measures) of the families represented.
The ESIZE algorithm is available as part of the FPS software,
available here.
Postdoc Yael Mandel-Gutfreund is particularly interested in
predicting protein-nucleic-acid interations and may be willing to help
supervise projects in this area.
There are some specific open-source projects that would be
interesting to me:
I have started an open-source
C library (with a random generator for Dirichlet mixtures), and
would like to see it expanded into a more complete Dirichlet mixture
library, while maintaining a very modular, minimalist approach to the
code.
There have been requests for source code for our
sequence-weighting techniques (which rely on Dirichlet
mixture computations). It would be very useful to extract these
sequence weighting techniques from predict-2nd, estimate-dist, or
the SAM software suite, and incorporate them into a open-source
project. The biggest difference between our different
implementations is the different implementations of the
multiple-alignment data type. If you started with an open-source
project that had an adequate representation, it could be quite
straight-forward to port the code.
Dali,Vast,
CE, and Kenobi are worth
looking at for inspiration.
Few of the structure-structure aligners use sequence
information to help disambiguate the structural alignment (one
exception is Jung and Lee's program Sheba).
For an example of where ignoring sequence can result in poor
structural alignments, consider DALI's alignment of 1c1gA, 1c1gB,
1c1gC, and 1c1gD. These are all identical sequences, but 1c1gA and
1c1gB are aligned with only 11% residue identity in FSSP, because the
coiled-coils are slightly kinked in the solved structure.
Note: the Dali Domain Dictionary uses T-coffee with DALI
structural alignments to create multiple alignments, so there may not
be much new that needs to be done here.
The project here could be to build a small "legos" library, or to
try to find a use for Zhipeng Weng's library within undertaker,
Rosetta, or some other fragment packer.
The scoring functions for fragment packers and threaders have
different constraints, as fragment packers have reason to believe that
the local structures of the backbone are reasonable, but need to
determine of the overall conformation is protein-like, while threaders
know that all conformations are protein-like, but that the residues
may be in highly unlikely conformations (locally or in terms of
tertiary contacts).
Because the objectives of threading and fragment-packing score
functions are different, it may be useful to evaluate score functions
in both contexts (before throwing them out as useless).
I would like for us to be able to generate such plots efficiently,
using the c++ library of routines that we have developed for handling
protein structures, transformations, and so forth.
Note: this project was done last year, but the results were not installed.
It may be a very small project to clean up the code and install it,
but there are certainly many more features that could be added to the program.
The project is to provide a web interface that allows retrieving
individual multiple alignments and HMMs, by giving PDB sequence ids,
keywords, or protein sequences.
When we do not have exactly the requested sequence, similar sequences
should be offered (perhaps using FSSP's TABLE2 to suggest sequences).
The search by sequence should probably use some version of blastp,
though an option for doing the more expensive search by scoring the
sequence against all models should be provided.
It would be good to offer the alignments in the same formats that the
SAM-T99 web page offers (a2m, prettyalign, html, sequence logo) and
even a2m_with_dots.
In the HTML format, we should show what sequences are structurally
aligned by FSSP and give DALI Z-scores, We should also link any PDB
sequences to their entries in PDB, FSSP, DSSP, SCOP, ... .
We should give DSSP and STRIDE secondary structure in the
alignment part of the HTML file.
The interface should be easy to use, with easy maintenance as new
multiple alignments are added.
There should probably be a queue of old alignments to rebuild, to take
advantage of updates to the sequence databases. When an alignment
more than x months old is retrieved by a user, it should be queued for
rebuilding. Instead of maintaining a consistent FSSP database, we
might want to update TABLE2 and TABLE3 weekly, but only update the
FSSP files themselves when an alignment is rebuilt.
It would also be a good idea to provide some way to download the
entire set of alignments using FTP.
This would probably require a script that uses tar to bundle all the
existing alignments into a single file and gzips it.
Whether this script is called on demand or is run as a cron job on a
regular basis depends on how frequent ftp access for the whole set are
expected to be.
There should probably also be FTP access to individual .a2m files.
We will soon have other such large libraries (such as predictions
for the whole human genome).
It would be good for the mechanism to be easily modified to
incorporate different libraries.
TeXshade
Compare different protein family diversity measures
Predicting protein-protein and protein-nucleic-acid interaction sites
Open source projects
Open source library for Dirichlet mixture functions
Regularizer library for BioJava
We have a moderately large library of useful C++ classes in /projects/compbio/ultimate.
It might be useful to redesign some of these and release them as
part of BioJava. I think that the Alphabet and Regularizer types
are particularly useful, though there may need to be some redesign
of Alphabet to be compatible with existing BioJava design.
Add to Telegraph open-source dynamic programming project
Yet another structure-structure aligner
Use of residue pairs in protein sequence-sequence and
sequence-structure alignments
Protein Science (2000), 9: 1576-1588.
Jongsun Jung And Byungkook Lee
It might be interesting to incorporate information from a posterior
decoding matrix (which is generally better than a sequence-sequence
alignment) into a Dali-like structure-structure comparison.
Multiple alignment of protein structures
Protein legos
Build a threader
Hubbard plots
Web interface for SAM-T99 library
Here are some possible topics (with rough size in lectures for each), though naturally not all these topics could be covered in one course:
A somewhat simpler approach has been proposed based on the observations of Michael Tanner for similar problems in decoding complex codes. Instead of doing optimum decoding, you are often better off using a better code (that is, a better model of your data) and using a non-optimal decoding scheme. The iterative scheme that Tanner has used is quite simple to implement and can be easily translated into the sort of information we get from an RNA secondary structure model (like an SCFG).
I don't know whether the Drosophila genome is fully annotated for cytological location (if it is, this would be simple query construction). If it is not (the more likely case), then location prediction would be needed (of course, location information would need to be marked to indicate the degree of certainty). There has been a fair amount of literature on predicting the cytological location of proteins, but there is still a lot that can be done, but in determining what features to consider and in determining what classification algorithms to use.
This looks like a good application for the protein sequence analysis that Bioinformatics I tries to teach. There are at least 54 PDB chains identified as ATPase, and I'm not sure which of them are of interest. I suspect that bovine mitochondrial f1-atpase (FSSP reps: 1skyB [alpha subunit], 1skyE [beta subunit]) is one of the ones of interest. These are both 3-domain proteins, with SCOP domains 1.69.1.1.3, 2.46.1.1.3, and 3.31.1.10.5. The PDB file 1bmf has a different quaternary structure for the proteins.
Other FSSP representatives (and SCOP domains) of chains that have "atpase" in their names are 1a91 (6.2.1.1.7), 1aw0 (4.48.16.1.2), 1b8xA (1.47.1.1.19, 3.42.1.5.19), 1ba1 (3.50.1.1.1), 1bmfG (1.20.1.1.1), and 1cz4A (2.49.2.3.3, 4.27.1.1.3). I don't know if any of these are relevant, though 1bmfG clearly interacts with the alpha and beta subunits (indeed, it is part of gamma-f-atpase).
See also some earlier work on v-atpase in
/projects/compbio/experiments/protein-predict/bowman/v-atpase
Note particularly the importance of using reverse-sequence null modles
because of the long ampipathic helices in these proteins.
I have one program (phytree) which is my reimplementation of a
somewhat crude phylogenetic tree method developed by Kimmen Sjolander.
Despite its rather crude trees, it has been valuable in helping select
templates for protein structure prediction, when there are several
templates from the same superfamily available.
This program could stand quite a bit of work (for example, eliminating
the crufty readseq input routines and using zlib to read compressed
a2m alignments). It might also be pedagogically valuable to try
implementing some of the standard phylogenetic tree algorithms (such
as neighbor joining).
A more ambitious student could try to implement some of the more
modern ideas on phylogenetic trees. The
Symposium on Competing Technologies for Phylogenetics (SCOPH)
(April 19-21, 2001 Universite de Montreal, Montreal, Quebec) lists the
following:
The meeting is fairly cheap ($50 for students), so if you can get to
Montreal just before RECOMB 2001 and are interested in phylogenetics, it looks like the place
to be.
Ion channel (nanopore)
Evolutionary (phylogenetic) trees
Olivier Gascuel, Distance based methods
Tandy Warnow, Fast converging methods
David Bryant, Quartet and combinatorial methods
Andreas Dress, Networks and split decomposition
Kevin Nixon, Parsimony
David Sankoff, Gene order parsimony
David Swofford, Maximum Likelihood
John Huelsenbeck, Bayesian methods
Mike Steel, General stochastic models; Hadamard conjugation
Tom Hagedorn, Phylogenetic invariants
Joe Felsenstein, Validation methods
|
|
| Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building