CMPS 243 Winter 2002 -- Bioinformatics I

Project List

(Last Update: 11:21 PST, 30 December 2001 )
This project list has been updated from last year's list, but there may still be additions as I remember other projects that I had forgotten to include.

Approximately half the grade for this course will be based on the successful completion of a substantial project. Students are expected to turn in

Fri Jan 11: half-page proposal
Students should select a project within the first two weeks of the class, and make a very short proposal outlining what they want to do.
Fri Jan 25: detailed research plan
By week 4, students should have researched previous work on the topic of their project, done some preliminary assessment of the difficulty of the project, determined how much they can get done in a quarter, made a list of milestones, scheduled deadlines for various parts of the project, and perhaps started preliminary coding. The research plan should be roughly the equivalent of the introduction and prior work sections of a research paper, with an outline (with deadlines) for the rest of the paper.
Weekly drafts of final report, each Friday at 4pm
Students are required to turn in a weekly report starting Jan 18. Perhaps the best way to keep the amount of writing to a minimum is for the weekly report to be drafts of the final project report, with change bars in the margin indicating any changes from the previous week's report. The research plan due Jan 25 is one of these progress reports.
Final report, Tuesday, 18 March 2002, 8am.
The final report is primarily what will be evaluated, but the timeliness and usefulness of the weekly reports will also be considered. It is expected that as many as half the projects will be suitable for conference or journal publication (perhaps with an extra quarter's work).
Poster Session, Tuesday 19 March 2002, 8 a.m.--11 a.m.
In addition to the written report, we will have a poster session during the scheduled exam time for the course.

Each student is expected to have a 30-minute weekly meeting with the instructor. These meetings will be used to discuss the project, aid in debugging, and help with homework.

Sufficiently large projects may be tackled by a group of students, but past experience has favored individual projects. Projects can be supervised by more senior graduate students or faculty other than the instructor---these are often the most successful projects. External supervision is in addition to, not in place of the weekly meetings with the instructor.

This is a list of potential projects for students in Bioinformatics I. There are, of course, many other potential projects, and each student will be required to submit a detailed proposal for the project that he or she intends to do.

The best projects are usually ones that students are most motivated to do, so don't consider yourself constrained by this list---if you have a project idea you would like to pursue, propose it!


Project list--Winter 2002

This list contains project ideas for this year. Some have been adapted from previous years' lists, others are new. The main focus of the course this quarter will be on proteins---particularly protein structure prediction. If you have projects that are primarily dealing with nucleic acids, I recommend keeping them until Spring quarter in 244.

Fold recognition

Our fold-recognition methods are based on a number of different components: finding homologs of the target, building a multiple alignment, predicting local structure for the target, scoring templates sequences (and local structures) with the target HMM, scoring the target sequence with template HMMs, selecting the best templates, aligning the target with the templates, and selecting the best alignments. We want to improve several aspects of this process and provide a web interface to the whole thing.

We need to create a new fold-recognition server to replace the SAM-T99 server. There are many improvements that we have made in the past couple of years that have not been put into a server yet, and many ideas that have not yet been implemented.

In order to make the server easy to extend, we are planning to use a Makefile (using gnu's version of make) to control the computation. This makes it very easy to add new features to web site, separating the flow of control from the web interface. Rachel Karchin will be creating the web Makefile and web scripts, based on one of the Makefiles in /projects/compbio/experiments/protein-predict/ and on the SAM-T99 web site.

Improved homolog search

SAM-T2K needs to be updated to SAM-T02. For example, the last iteration of the search procedure should use a calibrated model. This means adding a calibration step, and adjusting the thresholds for the earlier steps so that the errors in the uncalibrated models don't make the early steps include too many sequences. (One could calibrate at each iteration, but this seems unnecessarily expensive.)

Improved search/multiple alignment by parameter adjustment

Fold-recognition tests should be done to tune some of the parameters---particularly the threshold of the final iteration and the transition regularizers needed. The current fold-recognition test will take about 40-80 CPU days to build all the models and do the test, but there are some details still to work out in doing the scripting. The model building is the slow part---testing a set of models takes only about 36 CPU hours.

A more modest version of parameter tweaking would not change the underlying multiple alignments, but just the weighting of the amino-acid track and the predicted secondary structure track.

The fold-recognition tests I'm currently using are in ~karplus/pcef/fssp-test-01, and are mainly summarized in the Makefile, though there are descriptions of it in some of my recent talks.

Improved search/multiple alignment using psiblast

The prefilter we have for SAM-T2k uses wu-blastp, and it would probably be a good idea to provide a prefilter for NCBI blastp as well. With NCBI blast, we could also speed up the extraction of the FASTA sequence files, which is woefully inefficient in the current implementation. (We would also get rid of the sequence counting step, since blast reports how many sequences it searches.)

It might be interesting to change the pre-filter method to generate psi-blast profiles from the HMMs, and use psi-blast as a prefilter at each iteration. This should allow us to detect slightly more remote homologs than our current pre-filtering strategy, without enormous increase in cost.

One warning: people think of psi-blast as very fast, but only the blast part is fast. When psiblast has to realign many sequences to a profile, it gets quite slow also. It may be that we only need to use the BLAST part of psi-blast, if we generate psiblast profiles from our HMMs. (Our HMM aligning is no faster than psi-blast's but it does seem to produce better alignments.)

Improved multiple alignment using T-coffee

The best multiple alignment program currently is T-coffee (at least based on the rather poor balibase alignment benchmark). We may get some advantage from using T-coffee to produce multiple alignments of the homologous sequences instead of buildmodel. T-coffee has a problem with large numbers of sequences (it is O(n^3)), so we may have to do a quick-and-dirty alignment with an HMM, thin with uniqseq, select a small enough subset of sequences, align the subset with T-coffee, convert back to the alignment format we use, then extend the alignment to all sequences using an HMM.

Improve secondary structure prediction

Our secondary structure predictor is one of the world's best, but David Jones's PsiPred is doing slightly better in the EVA evaluations. The difference is not significant according to the EVA criteria, but it has been fairly consistent---it may be due to their training on DSSP and our training on STRIDE.

I have several ideas for improving the secondary structure predictor, some of which are trivial and some of which are difficult. Some of the difficult ones are already being implemented by grad students (in their spare time), but a collaboration to get the methods finished and tested would useful even on these. Here are a few of the project ideas:

Predicting Local Structure

Several people have observed that the current secondary structure classification is not a very complete description of the local structure of proteins. There are definitely common conformations of short segments of protein backbones that are not well-expressed in the popular 3-state classification of residues. Some of these are highly predictable (see the ISITES library, for example).

The goal of this project is to come up with other structural features and methods for predicting them. We are looking for a one-dimensional representation here (some feature for each residue) that can be used with HMMs. This project can vary enormously in scope, from quick checks of a single proposed feature set to Ph.D. dissertations. In fact, Rachel Karchin has proposed a substantial investigation of local structure features for her Ph.D. thesis (see /projects/compbio/papers/karchin/thesis-proposal.ps).

Here are some possible criteria for choosing features:

Here are some ideas for possibly useful structural features to predict:

Parameter tweaking for improved use of local structure prediction

In CASP4, it looked like a lot of our improvement between the SAM-T99 automatic server and the SAM-T2K hand results came from the new 2-track HMMs that use both amino acid and secondary structure. We need to test different weightings of the two tracks for fold recognition and for alignment. These tests can be done more quickly than other parameter adjustments, since the underlying HMMs do not need to be rebuilt.

Multi-track HMMS for template models

The 2-track HMMs can currently only be used with target models, not template models, since we need an accurate secondary structure for the sequences and probability vectors for the model. If we could create reasonable probability vectors for the templates (perhaps by using context-sensitive substitution matrix on the real secondary structure string), we could use 2-track HMMs with template models to score the predicted secondary structure of the target.

The idea here is to look at what the probability is for the secondary structure predictor predicting a H, E, or L as the most probable letter in a specific context (such as mid-strand, beginning of strand, end of strand, ...). The appropriate probabilities can then be used in building the secondary structure HMM. The creation of the substitution matrix should probably be automated so that (1) the definitions of the contexts is easily changed and (2) new secondary structure predictors can have substitution matrices calculated for them quickly. The substitution matrices would actually be quite useful in designing and debugging secondary structure predictors, so should be a stand-alone tool.

Another way to create 2-track HMMs for template models is to use probability vectors in the sequence and labels in the states (instead of vectors in the states and labels in the sequence). In this way we could use the predictions of secondary structure for the target sequence with the known secondary structure sequence for both the target and the template models. This method would require some (fairly modest) changes to the inner loop of SAM's hmmscore program, and a number of I/O changes.

Calibrating HMMs

We have discovered that some assumptions made about the way statistical significance (the E-values) were computed in SAM are incorrect, and we have created a -calibrate option to hmmscore to calibrate the HMMs properly. Calibration consists of two parts: (1) generating and scoring sequences and (2) fitting the distribution of their scores. The quality of the calibration depends on the sequences chosen and on the quality of the fitting.

I have created a generator for amino-acid sequences that I believe creates better "random" protein-like sequences than standard methods. This generator is now built into hmmscore.

We could still use a generator for paired sequences (amino acid and secondary structure) for calibrating multi-track HMMs. There are several approaches for generating such pairs, and I don't know which will be best:

  1. Generate AA sequence as currently, and generate a 2ry sequence using the same approach of choosing a composition distribution (another dirichlet mixture) , and doing independent draws from it. This is what hmmscore does, but the secondary structure strings generated are terrrible decoys. The problem is that secondary structure letters usually come in runs---they are not even approximately independent from one position to the next. As a result the calibration from this process is worse than doing nothing.
  2. Generate AA sequence as currently, then use a neural net or other 2ry structure predictor to produce a secondary structure string. This would provide good correlation between the AA string and the 2ry string, but I would expect the 2ry strings to be unusually high in "loop" letters, since the random AA sequence is not likely to have the characteristics of structured proteins.
  3. Generate AA sequence as currently, but use a "generalized" HMM to generate the secondary structure string. Each state generates a segment, rather than a single residue. The states could be pure labeled states (just E, just H, or just L) and a length distribution would be kept for each state (either explicitly or as a parameterized distribution). I think that a small HMM of this type could be created that would match the existing data quite well. It might be hard to get overall length right, but probably easier than with residue-at-a-time HMMs. There would be no connection to the AA generator, except through the sequence length---it would probably be easier to generate the secondary structure string from the generalized HMM then generate an AA sequence to match the length. I have segment length histograms and a "segment composition" Dirichlet mixture, so this would be fairly straight-forward to implement for the EBGHTL alphabet.

    We would probably need a set of tools for dealing with these segment-based HMMs, so that we could train them on real 2ry strings and build a generator based on them. There may also be some applications of such HMMs to secondary-structure prediction, post-processing the output of residue-at-a-time predictors.

    A simpler version of this approach would generate the segment labels as almost independent draws (prohibiting duplicating the same letter twice in a row), then use the segment length distributions to generate the whole strings from the segment labels.

    This approach should work-well for 2-track HMMs (since the amino acid track and the secondary-structure track have only about 0.1 bits of mutual information when considered a column at a time), but not for multi-track HMMs with multiple variants of the secondary structure alphabet (since these will be highly correlated).

  4. Yet another approach is to generate the AA string and the secondary structure string simultaneously from an HMM, such as the HMM used by Spencer Tu to predict 2ry structure. It should be easy using Spencer's code to write a sequence generator that generates a random path through the HMM with random emissions from each state. The two outputs could be sent to a pair of FASTA files, to match the input expectations of the programs that handle multi-track data. I suspect that this generator would produce the most realistic decoys, but the sequence lengths might come out too short. It should be easy to check that hypothesis---just generate a 1000 or so sequences and make a histogram of their lengths.

In addition to generating the sequences (ideally entirely with C code, so that the sequence generator can be easily incorporated into SAM), we have to fit the parameters of the E-value computation to the observed distribution of scores. I'll cover in class the derivation of the E-value computation we use for the reverse-sequence null, and show how to fit lambda very easily. We can generalize the fitting, introducing an extra variable as an exponent on the scores. Fitting this two-parameter model requires some iteration, but is still fast. The two-parameter model we are fitting has no theoretical justification. it would be nice to come up with a heavy-tailed distribution that fit the data well and had some justification.

Our current best results are from moment-matching to fit the distribution of database sequences, but it would be nice to use a maximum-likelihood (or maximum posterior probability) approach instead. This would probably require a different family of heavy-tailed distributions.

Improve template HMMs by using structural alignments as seeds

An idea that we tried an rejected a couple years ago was to build SAM-Txx models starting with a structural alignment as a seed. The ideas was that structural alignments are better at pulling out the important residue conservation patterns than sequence alignments. This did not work as well as we expected. I think that the problem was that we were using fold-level multiple alignments, so the sequence signal was too diluted to be useful. We might be able to use superfamily-level structural multiple alignments more effectively. We could select the superfamilies either by using SCOP or by using some combination of the FSSP Z-score and %identity.

We have found that HMMs built from structural alignments do a better job of aligning remote homologs (not included in the structural alignment) than HMMs built from SAM-T99 alignments.

Certainly any method that selects between alignments should include alignments build from structurally aligned templates.

Labeling template hits by SCOP family and FSSP representative

It is useful when examining template matches to cluster similar templates together. There are four standard databases that are useful for doing this: SCOP, FSSP, Dali Domain Dictionary, and CATH. It would be useful to annotate each hit in the template library with the appropriate identifiers from the databases.

Note: this annotation is straightforward, but not completely trivial. The databases generally contain DOMAINS, not whole chains. Therefore, to avoid incorrect labeling, one has to examine the alignment between the target and the template and determine which domains of the template are matched by the target. The SCOP database may be the best to use for this purpose, though we will have some templates in the library that are not yet in SCOP.

Improve template selection by dividing into domains

Multi-domain targets often cause problems for fold-recognition techniques. One problem is that a particular domain matches fairly strongly and the other domains are either ignored (no sufficiently strong scores seen to compete with the good domain) or are mistakenly matched to domains that are sometimes paired with the strongly matching domain.

To avoid this problem, we could try to split the target up into separate domains, scoring each one separately. The hard part comes in figuring out where to put domain boundaries. Some possible methods include

One could try several different (possibly even overlapping) domain divisions, then select the one that gets the best matches to domain templates.

Improve template selection by combining information from multiple templates

Timothy Bailey and William Grundy have created a method for combining correlated p-values to get improved fold recognition. They applied their method to improve a method based on pairwise alignment (a rather weak initial method). It would be interesting to see if their method would improve fold recognition when applied to a more powerful method (such as SAM-T2K or SAM-T02). I suspect that it would improve the performance, since the extra information that several models for the same superfamily score well has been used in our human-intervention methods, but not so far in the automatic ones. A crude test of this was done last year in 243, and it seemed to help, but the test needs to be redone with better controls, and the full template library calibrated. The software for maintaining the template library as new templates are added also needs to be developed.

One problem with the product-of-p-values method is that all the templates in a fold family are treated equivalently. In practice there may be several templates from one subfamily and only one from another. Ideally, a scheme should be developed that can weight each template appropriately.

Improve template selection with functional information

In doing structure prediction for CASP, we often look at the functional description of the protein to choose between possible templates. It would be good to build some automatic methods for this. One promising preliminary approach is the SAWTED tool, which is part of the 3D-PSSM fold-recognition server.

The algorithm is quite simple: keywords and text are extracted from the best-matching Swissprot entries of the target and templates, and dot-products of their frequencies of words (excluding non-discriminating words) are used as a similarity measure. Actually, one usually normalizes to a "cosine" similarity measure:
a.b / (sqrt(a.a) sqrt(b.b))

The project would be either to re-implement and test the SAWTED algorithm here or to develop a similarly simple algorithm and test it. A possible variation: use the SAWTED score as a kernel function for an SVM multi-class classifier. The fold prediction method should be evaluated by itself and as a complement to an existing method (such as SAM-T99). (Possible Ph.D. project: developing a more sophisticated way to handle functional information.)

Some preliminary results were obtained last year and this method looks promising. A possible starting point for a scoring function is to take the E-value from the 2-track t2k-1-.3-ebghtl fold recognition results and divide by the "cosine" similarity measure raised to some power.

We need to look into more sophisticated representations than just a "bag of words". Possible improvements include stemming and multi-word phrases. The version last year only extracted information from the single best-matching Swissprot entry, when we probably should look at the top several Swissprot entries.

Improved template selection and alignment with HMM-HMM alignment

We have had moderate success with HMM-based structure prediction, but we could probably do better with it. One thing that we have not experimented much with is to use HMM-HMM alignment. It might be good to try doing local alignment with matching state i of one HMM to state j of the other costing -log sum_a P(a_i)P(a_j) (the co-emission probability). Figuring out what to charge for insertions and deletions is more difficult, as the different HMMs have different transition costs (we could simply ignore the costs from one of the HMMs, for example). Local alignments could be tested either for alignment quality (using Melissa Cline's alignment tests) or for fold-recognition capability (using the test set that we have used for testing SAM-T2K).

Another approach to aligning two sequences is to take the component-wise product of the posterior decoding matrices for sequence A against HMM B and sequence B against HMM A, and doing posterior decoding of the resulting matrix.

Improved target/template alignment

We have not yet determined the best way to to align target with template. There is a pairwise alignment test set documented in ~cline/for_kk/aligntest.html and ~cline/for_kk/comparison.html. These tests have not been applied to multi-track HMMs yet, but we are VERY interested in knowing whether the multi-track HMMs lead to better alignments (and not just better fold recognition).

Selecting good target/template alignments

So far, the only automatic ways we have of selecting alignments are to look at the E-values reported by the HMMs that created the alignments or to select a general method (such as aligning the target to an HMM based on a structural alignment of templates) based on some alignment tests. In the hand-selection of alignments we examine the 3d structure that the alignment predicts for the target.

In particular, we look for features like the compactness of the resulting model, whether internal strands of beta sheets are missing (though surrounding strands are present), whether active-site residues are preserved, whether identical residues cluster in 3-space, whether they form stripes across beta sheets, and so forth. Some of these checks can be automated (such as the compactness test). It would be good to automate some of the checks, and test them to see if they improve the fold-recognition capabilities of the SAM-T2K method. The first checks I would attempt to automate are radius of gyration and contact order, both of which are easy to compute from a structure. We'd have to get histograms of these measures for real structures and for "good" predictions, to see what to expect.

Other 3D scoring techniques (such as threading score functions or fragment-packing score functions) could also be used to filter the top hits from the HMM scoring.

One would have to be careful with any scoring function, since the sizes of the alignments can be quite different and that can affect the scores more than the quality of the prediction does.

Viewing predictions

For viewing our predictions, we have been using rasmol as the graphics engine, and a locally produced program (saen) that shows the alignment and allows selecting residues from the alignment as arguments for rasmol commands. It also allows editing the alignment, highlighting the identical residues. This program really needs to be rewritten to be more easily ported between systems (it currently relies on the AceDB package for doing graphics, but should probalby use something like OpenGL instead). It is working on the old Alpha systems, but has not been ported to the Linux boxes yet---it would be very useful to have it rewritten and improved.

Undertaker

For the past two or three years, I have been working (on and off) on a protein-structure prediction program that uses "fragment packing", an approach introduced by K.T. Simons and David Baker of the University of Washington. The program is now working (that is, it is generating conformations and scoring them), but I have a huge list of things to try to make it actually work (it doesn't come close to doing what Baker's Rosetta does).

If several people pick projects from this list, we will probably have to start using cvs to manage the undertaker source code, or we'll never get all the pieces integrated.

Protein projects that are NOT part of our CASP effort

Creation of protein web pages

I would like to create a script for quickly building a web page of useful information and links for a specific protein. You can get an idea of what sort of web page I'm thinking of by looking at the mockup for DNA-repair proteins. When the script is completed, we should offer to make up web pages for each of the proteins being studied by biologists or chemists here at UCSC.

A variant of this project for the non-computer people: Pick a protein or protein family whose structure has not been solved yet, and use existing tools (here and on the web) to do the best job you can of predicting the structure, homologs, evolutionary relationships, active site residues, function, ... of the protein. Prepare the result either as a proposal for specific wet-lab work to verify computational hypotheses or as comprehensive web page for a lab that is studying the protein. This project will require a lot of library work and use of web tools (for example, the metaserver at bioinfo.pl), but not much original programming. This is an end-user's view of bioinformatics, rather than a tool-developer's view.

Last year, Lydia Gregoret expressed an interest in creating a web site for "all things cold shock protein related". She said

There might be links to

  • labs studying cold shock protein function
  • labs studying cold shock protein folding
  • labs studying cold shock protein structure
  • evolution of the cold shock fold
  • automated PubMed search for cold shock articles
  • automated PDB search for cold shock protein structure
  • nice images of cold shock proteins
among other things. An example site on lectins can be found at
http://www.cermav.cnrs.fr/databank/lectine/ This is not exactly what I had in mind, but a start.

The student would learn about the relationship between protein sequence, structure, and function, what kinds of information is out there and also gain experience in scientific web page design.

Lydia is no longer at UCSC, but other researchers in biology or chemistry may want web pages for their favorite protein families---ask around!

Support Vector Machines for subfamily classification

The hidden Markov models (HMMs) used by SAM have been our main tool for protein family recognition. They work pretty well at the superfamily level, but are often not discriminating enough to distinguish subfamilies.

Rachel Karchin has used support vector machines (SVMs) to build subfamily recognizers for the subfamilies of the GPCRs (G-protein-coupled receptors). She started with the methods and code developed by Tommi Jaakkola and Mark Diekhans, but has since rewritten the SVM programs in Java for easier modification and portability.

Projects here include building SVMs for classifying other families of proteins (using existing tools), trying to improve the SVMs by reducing the large feature vectors, experimenting with different ways of generating feature vectors, trying other ways to use the feature vectors besides SVMs, ... .

Classifying proteins by SVMs is an attractive problem, because we usually have a fairly small set of positive training examples. The big problem is that we often have an extremely large number of negative examples. We might be able to make faster, more accurate SVMs by doing pre-screening with the HMM used to create the feature vectors. If we pick a threshold for the HMM score that is loose enough to include all the positive training examples and about 100 false positives, and use the SVM only for classifying the sequences that pass this HMM test, then the SVM only needs to learn the difficult cases that really are near the boundary. This should result in smaller classifiers (fewer support vectors) that do as good a job or better and are much faster to train [training time is generally quadratic in the number of training examples].

One particularly useful project would be to build a library of classifiers (or a multi-family classifier) for all the SCOP superfamilies. This could be used as a fold recognizer, followed by HMM-based aligners to produce the alignments. If we could get this automated, I think it would make an excellent new structure prediction server. In version 50 of SCOP there are 934 superfamilies, ranging in size from one structure (160 such superfamilies) to 1822 (for superfamily 2.1.1--immunoglobulins). Other heavily populated superfamilies include lysozyme-like (4.2.1, 648 members), globin-like (1.1.1, 612 members), trypsin-line serine proteases (2.44.1, 510 members), NAD(P)-binding Rossmann-fold domains (3.2.1, 454 members), Thiamin diphosphate-binding fold (THDP-binding) (3.31.1, 421 members). [Note: a SCOP classifier would be of use in the fold-recognition experiments for CASP.]

New kernels for Support Vector Machines

The "Fisher score" vector introduced by Tommi Jaakola seems to be a pretty good feature vector when used with a properly scaled radial-basis kernel. It is a rather large vector, though, and many of the features are probably just adding noise. It would be interesting to try some experiments in selecting features out of the vector or weighting each feature differently. Also, the radial-basis kernel only looks at the magnitude of differences between features, not at the sign of the difference. It might be interesting to explore other kernels that could use the sign information as well.

This project would probably consist of a detailed exploration of the feature vectors for one, difficult classification problem (perhaps one of the GPCR subfamilies that Rachel Karchin has worked on). We would want to look for ways to measure the value of a particular feature, then either weight features or select a subset of them, train an SVM using the modified feature set, and compare its performance with an SVM trained on the full feature vector.

Here are some preliminary thoughts on ways to look for informative features:

Another variant on the radial-basis kernel that I want to explore (perhaps with Rachel Karchin) is replacing the "sigma^2" scaling factor with a separate scaling factor for each vector in the training set: K(i,j) = exp(-0.5 ||fi-fj||^2 / (sigma_i sigma_j)). The scaling factor could be set to the distance from the vector to the closest vector with a different label. For unlabeled (test) data, the scaling factor could be set to the maximum over the possible classes of the minimum distance from the test vector to the training vectors in the class. Having different scaling for different training vectors may help generalize the classification better, since the kernel won't drop to zero so quickly away from the data, except in the neighborhood of the boundary.

TeXshade

There is a package, TeXshade, for displaying multiple alignments in LaTeX documents. This package does not currently accept the A2M format that we use for multiple alignments. Adding that capability to TeXshade would be quite useful. We might also want to add to our existing web servers a TeXshade output option for alignments.

Compare different protein family diversity measures

There are several ways to measure the diversity of a family of proteins. For example, we can use the total sequence weight needed to get 0.5 bits/column saved using the recode3.20comp regularizer on a multiple alignment of the family. Another method is to use the ESIZE algorithm described in this paper, which relies on fitting the distribution of scores from a number of random queries.

One could investigate the relationship between effective family size on the one hand and percent sequence identity, pairwise sequence p-values, and other measures of sequence similarity. Characterize existing databases (Pfam, SCOP, etc.) in terms of the sizes (using different measures) of the families represented. The ESIZE algorithm is available as part of the FPS software, available here.

Predicting protein-protein and protein-nucleic-acid interaction sites

It may be useful, once a structure has been determined, to predict what locations on the protein might interact with other proteins. One approach is to predict for each surface residue whether it will interact with other proteins. This could be based on the identity of the residue, its chain neighbors, its neighbors in 3-space, or local structural features of protein. Huan-Xiang Zhou claims a 70% success rate on interface/non-interface predictions (not published? at UCSC Bioinformatics Symposium Dec 2000). The biggest problem with this project is getting training data. Zhou had to use crystal contacts to get enough training data, and it is not clear to me that crystal contacts are representative of biologically interesting protein-protein contacts.

Postdoc Yael Mandel-Gutfreund is particularly interested in predicting protein-nucleic-acid interations and may be willing to help supervise projects in this area.

Open source projects

There are many open-source projects in bioinformatics: bioperl, biojava, biopython, bioxml, biocorba, ensembl, GAME XML, DAS, E-CELL, FAKtory, Telegraph, ... , but UCSC has not been part of these. You can find a partial list under the BOSC 2000 conference site. If you want to contribute something useful to one of these projects--pick the project, identify the need you plan to fill, and write a proposal for the class.

There are some specific open-source projects that would be interesting to me:

Open source library for Dirichlet mixture functions

We have done a lot of pioneering work with the use of mixtures of Dirichlet distributions, but the ideas have been slow to catch on, because many bioinformatics people are scared of the mathematics. An open-source library of useful routines would go a long way toward easing the incorporation of Dirichlet mixtures in programs. We have implemented Dirichlet mixtures in C (for SAM) and C++ (for the "ultimate" library), but neither of these implementations is really modular enough to easily incorporated in other people's code.

I have started an open-source C library (with a random generator for Dirichlet mixtures), and would like to see it expanded into a more complete Dirichlet mixture library, while maintaining a very modular, minimalist approach to the code.

There have been requests for source code for our sequence-weighting techniques (which rely on Dirichlet mixture computations). It would be very useful to extract these sequence weighting techniques from predict-2nd, estimate-dist, or the SAM software suite, and incorporate them into a open-source project. The biggest difference between our different implementations is the different implementations of the multiple-alignment data type. If you started with an open-source project that had an adequate representation, it could be quite straight-forward to port the code.

Regularizer library for BioJava

We have a moderately large library of useful C++ classes in /projects/compbio/ultimate. It might be useful to redesign some of these and release them as part of BioJava. I think that the Alphabet and Regularizer types are particularly useful, though there may need to be some redesign of Alphabet to be compatible with existing BioJava design.

Add to Telegraph open-source dynamic programming project

There is an effort to make an open-source general-purpose dynamic programming package (called Telegraph). Ewan Birney, the author of Dynamite and Genewise, has described Telegraph as "Dynamite done right". Unfortunately, Telegraph is currently far from completed, and exists mainly as PERL code, which is too slow to be of any real use. Creating PERL-callable C modules for Telegraph could make it into an important and useful tool. This is an excellent way to improve your understanding of dynamic programming algorithms.

Yet another structure-structure aligner

It seems that every group wants its own structure-structure aligner---no one is ever happy with the existing ones. The "standard" method now seems to be to extract the secondary structure elements and align them with double dynamic programming, then refine the alignment to minimize RMSD or some function of the difference in the distance maps. Some methods are optimized for making very rapid similarity decisions, others are optimized for high quality alignments. Some use rigid superposition, others try to take into account small amounts of flexibility (say by using distance maps), though almost all structure-structure aligners have trouble handling hinged regions (whether domains or loops). Some methods prefer to align large numbers of residues (even if the correspondence is loose), others align only the most similar parts of the structure (even if this leaves many gaps).

Dali,Vast, CE, and Kenobi are worth looking at for inspiration.

Few of the structure-structure aligners use sequence information to help disambiguate the structural alignment (one exception is Jung and Lee's program Sheba).
Use of residue pairs in protein sequence-sequence and sequence-structure alignments
Protein Science (2000), 9: 1576-1588.
Jongsun Jung And Byungkook Lee
It might be interesting to incorporate information from a posterior decoding matrix (which is generally better than a sequence-sequence alignment) into a Dali-like structure-structure comparison.

For an example of where ignoring sequence can result in poor structural alignments, consider DALI's alignment of 1c1gA, 1c1gB, 1c1gC, and 1c1gD. These are all identical sequences, but 1c1gA and 1c1gB are aligned with only 11% residue identity in FSSP, because the coiled-coils are slightly kinked in the solved structure.

Multiple alignment of protein structures

Most of the multiple alignments of protein structures have been created as "master-slave" alignments. It might be interesting to create multiple alignments of superfamilies or folds by using T-coffee to combine pairwise structural alignments. T-coffee has the ability to accept any number of different alignments of the same pair, so different structural aligners could be used to build the library of pairwise alignments. One problem with this approach is that T-coffee does not seem to have a notion of unaligned residues, which is clearly necessary when aligning structures. It may be necessary to re-implement T-coffee with a better underlying model of alignments (allowing unaligned regions) in order to use the approach to get good multiple structure alignments. Another approach is to use a genetic algorithm to adjust the multiple alignment, as is done for pairwise alignments by Kenobi.

Note: the Dali Domain Dictionary uses T-coffee with DALI structural alignments to create multiple alignments, so there may not be much new that needs to be done here.

Protein legos

The success of the Rosetta fragment packer has led to a lot of interest in finding small, reusable pieces of proteins. Some people have concentrated on pieces with continuous backbones (which I call "fragments"), others have looked for conserved chunks (such as helix-helix packing and helix-sheet packing). Zhipeng Wang (Boston University) has called these repeated 3D chunks "protein legos", and has attempted to find them automatically using the Kenobi structure-structure aligner. The idea is to chop a pair of protein structures up into pieces and form alignments of the pieces, then expand to alignments of parts of the protein. The resulting protein chunks can then be put in a library and all-against-all alignment done to cluster them. Highly populated clusters are common "legos", and their frequency can (perhaps) be used in scoring functions for fragment packers or to guide fragment packers toward popular ways to pack helices and sheets.

The project here could be to build a small "legos" library, or to try to find a use for Zhipeng Weng's library within undertaker, Rosetta, or some other fragment packer.

Build a threader

We have worked with 1D (HMM) methods for protein structure prediction, and have started working on fragment packing, but have never built a traditional threader, which tries to align sequences to fixed 3D frameworks. It would be useful to have a threading program for evaluating score functions and to round out our suite of protein-structure prediction tools. This project would be a good way to learn bout things like the frozen approximation and double dynamic programming (see, for example, David Jones's Threader programs).

The scoring functions for fragment packers and threaders have different constraints, as fragment packers have reason to believe that the local structures of the backbone are reasonable, but need to determine of the overall conformation is protein-like, while threaders know that all conformations are protein-like, but that the residues may be in highly unlikely conformations (locally or in terms of tertiary contacts).

Because the objectives of threading and fragment-packing score functions are different, it may be useful to evaluate score functions in both contexts (before throwing them out as useless).

Hubbard plots

At CASP3 in 1998, Tim Hubbard introduced a new way of looking at the quality of a prediction of protein structure, now called "Hubbard Plots". The plots for CASP3 were very useful in determining whether a particular prediction was good relative to others submitted. Similar plots, using a slightly different method, were created for CASP4, but the results are not public yet.

I would like for us to be able to generate such plots efficiently, using the c++ library of routines that we have developed for handling protein structures, transformations, and so forth.

Note: this project was done last year, but the results were not installed. It may be a very small project to clean up the code and install it, but there are certainly many more features that could be added to the program.

Web interface for SAM-T99 library

We have a large library (over 5700 entries) of multiple alignments and HMMs built using SAM-T99 and SAM-T2K. This library is currently not available to the public, though it is used by the SAM-T99 web page for searches. You can browse the library (with no useful assistance) currently at http://www.soe.ucsc.edu/~karplus/sam-alignment-library/

The project is to provide a web interface that allows retrieving individual multiple alignments and HMMs, by giving PDB sequence ids, keywords, or protein sequences. When we do not have exactly the requested sequence, similar sequences should be offered (perhaps using FSSP's TABLE2 to suggest sequences). The search by sequence should probably use some version of blastp, though an option for doing the more expensive search by scoring the sequence against all models should be provided.

It would be good to offer the alignments in the same formats that the SAM-T99 web page offers (a2m, prettyalign, html, sequence logo) and even a2m_with_dots. In the HTML format, we should show what sequences are structurally aligned by FSSP and give DALI Z-scores, We should also link any PDB sequences to their entries in PDB, FSSP, DSSP, SCOP, ... . We should give DSSP and STRIDE secondary structure in the alignment part of the HTML file.

The interface should be easy to use, with easy maintenance as new multiple alignments are added. There should probably be a queue of old alignments to rebuild, to take advantage of updates to the sequence databases. When an alignment more than x months old is retrieved by a user, it should be queued for rebuilding. Instead of maintaining a consistent FSSP database, we might want to update TABLE2 and TABLE3 weekly, but only update the FSSP files themselves when an alignment is rebuilt.

It would also be a good idea to provide some way to download the entire set of alignments using FTP. This would probably require a script that uses tar to bundle all the existing alignments into a single file and gzips it. Whether this script is called on demand or is run as a cron job on a regular basis depends on how frequent ftp access for the whole set are expected to be. There should probably also be FTP access to individual .a2m files.

We will soon have other such large libraries (such as predictions for the whole human genome). It would be good for the mechanism to be easily modified to incorporate different libraries.

Non-protein projects

Tutorials

If you have very strong writing skills and would rather write English than programs, you can write a few tutorials as your class project. The BME 100 (undergraduate bioinformatics class) still lacks an adequate textbook.

Here are some possible topics (with rough size in lectures for each), though naturally not all these topics could be covered in one course:

  1. 1 Review of fundamental dogma of biology, DNA bases, RNA bases, Amino acids
  2. 2 DNA sequence databases on the web: genomic, unfinished genomic, EST, clustered EST. Specific databases include Genbank, NCBI's NR, dbEST, and (Unigene, STACK, EGAD, or other clustered EST database) Error rates in different DNA databases due to sequencing error
  3. 2 Protein sequence and structure databases on the web: Swiss-Prot, TrEMBL, NCBI's NR, PDB, FSSP, SCOP, CATH Visualization tools: RASMOL or CHIME
  4. 3 Pairwise alignment: substitution matrices and dynamic programming---both Needleman-Wunsch and Smith-Waterman algorithms
  5. 3 Probabilistic models: Bayesian statistics, E-value, null models This will cover what E-values mean, but not the theory for computing them.
  6. 2 Fast search: BLAST algorithm and interpreting BLAST results. A cursory overview of the BLAST algorithm (not enough to re-implement it), with more thorough coverage of the uses of different options to BLAST. Visualization tool: NCBI's pictorial BLAST results.
  7. 4 Multiple sequence alignment: clustal, HMMs, regularizers Information and relative entropy Visualization tools: belvu (or equivalent) and sequence logos
  8. 4 Protein structure prediction: HMMs, threading, 2ry structure prediction, physics-like ab initio prediction
  9. 3 DNA chips: how they work, design of oligos, extracting data from images
  10. 3 mRNA expression analysis: clustering of profiles
  11. 4 Genefinding: HMM structure, signals and content sensors, using EST and homology data, using cross-genome similarity
  12. 4 RNA structure prediction and alignment: energy minimization (Zucker), correlated mutation, stochastic context-free grammars
  13. 6 Phylogeny: parsimony, max likelihood, evolutionary models
  14. 2 DNA assembly
  15. 2 X-ray crystallography---how are models created
  16. 3 NMR spectroscopy---how are models created
  17. 2 2D gel proteomics
  18. 2 Mass spec proteomics
  19. 3 Finding binding sites in DNA regulatory regions

RNA alignment

At UCSC, we have used stochastic context-free grammars (SCFGs)to do multiple alignments of large RNA molecules (particularly the ribosomal RNA molecules). Unfortunately, the full dynamic programming on SCFGs is very slow. Michael Brown came up with one heuristic solution, using constraints derived from an HMM alignment to bound the dynamic programming. This approach worked well, but required a lot of work to implement.

A somewhat simpler approach has been proposed based on the observations of Michael Tanner for similar problems in decoding complex codes. Instead of doing optimum decoding, you are often better off using a better code (that is, a better model of your data) and using a non-optimal decoding scheme. The iterative scheme that Tanner has used is quite simple to implement and can be easily translated into the sort of information we get from an RNA secondary structure model (like an SCFG).



Below this line are LAST year's potential projects. Some of them may be still available, but check with Prof. Karplus.

Drosophila indexing

Bill Sullivan (Department of Molecular, Cell and Developmental Biology) suggests that it would be useful "to have a simple read of the classes of genes by cytological location--for example, what are the locations of all the serine/threonine kinases, or motor proteins, ..."

I don't know whether the Drosophila genome is fully annotated for cytological location (if it is, this would be simple query construction). If it is not (the more likely case), then location prediction would be needed (of course, location information would need to be marked to indicate the degree of certainty). There has been a fair amount of literature on predicting the cytological location of proteins, but there is still a lot that can be done, but in determining what features to consider and in determining what classification algorithms to use.

ATPases

Lincoln Taiz has suggested "something I think needs doing is to carry out an analysis of the various minor subunits of the F- and V-ATPases. (There are about 10 subunits altogether). Crystal structures are still lacking for the holoenzymes. Although functionally analogous, many of the subunits seem to be only distantly related by sequence analysis. Presumably their structures are conserved. Can students detect these structural features?"

This looks like a good application for the protein sequence analysis that Bioinformatics I tries to teach. There are at least 54 PDB chains identified as ATPase, and I'm not sure which of them are of interest. I suspect that bovine mitochondrial f1-atpase (FSSP reps: 1skyB [alpha subunit], 1skyE [beta subunit]) is one of the ones of interest. These are both 3-domain proteins, with SCOP domains 1.69.1.1.3, 2.46.1.1.3, and 3.31.1.10.5. The PDB file 1bmf has a different quaternary structure for the proteins.

Other FSSP representatives (and SCOP domains) of chains that have "atpase" in their names are 1a91 (6.2.1.1.7), 1aw0 (4.48.16.1.2), 1b8xA (1.47.1.1.19, 3.42.1.5.19), 1ba1 (3.50.1.1.1), 1bmfG (1.20.1.1.1), and 1cz4A (2.49.2.3.3, 4.27.1.1.3). I don't know if any of these are relevant, though 1bmfG clearly interacts with the alpha and beta subunits (indeed, it is part of gamma-f-atpase).

See also some earlier work on v-atpase in /projects/compbio/experiments/protein-predict/bowman/v-atpase Note particularly the importance of using reverse-sequence null modles because of the long ampipathic helices in these proteins.

Ion channel (nanopore)

Last year some people worked successfully on data analysis from the ion-channel nanopore project. There may be opportunities for more work there. Contact Mark Akeson and Stephen Winters for more info. The nanopore itself is PDB file 7AHL, and is well worth looking at. The barrel that goes through the membrane is a very pretty one, with charged groups at both ends and no charges (inside or outside) for the length of the barrel.

Evolutionary (phylogenetic) trees

It is often handy to examine a set of sequences and group them by similarity. Evolutionary trees are a popular way of expressing the relationships among sequences. We do not use evolutionary trees much in our tools---not because they aren't useful, but because it is hard to make good ones, and bad ones can be worse than useless.

I have one program (phytree) which is my reimplementation of a somewhat crude phylogenetic tree method developed by Kimmen Sjolander. Despite its rather crude trees, it has been valuable in helping select templates for protein structure prediction, when there are several templates from the same superfamily available.

This program could stand quite a bit of work (for example, eliminating the crufty readseq input routines and using zlib to read compressed a2m alignments). It might also be pedagogically valuable to try implementing some of the standard phylogenetic tree algorithms (such as neighbor joining).

A more ambitious student could try to implement some of the more modern ideas on phylogenetic trees. The Symposium on Competing Technologies for Phylogenetics (SCOPH) (April 19-21, 2001 Universite de Montreal, Montreal, Quebec) lists the following:
Olivier Gascuel, Distance based methods
Tandy Warnow, Fast converging methods
David Bryant, Quartet and combinatorial methods
Andreas Dress, Networks and split decomposition
Kevin Nixon, Parsimony
David Sankoff, Gene order parsimony
David Swofford, Maximum Likelihood
John Huelsenbeck, Bayesian methods
Mike Steel, General stochastic models; Hadamard conjugation
Tom Hagedorn, Phylogenetic invariants
Joe Felsenstein, Validation methods

The meeting is fairly cheap ($50 for students), so if you can get to Montreal just before RECOMB 2001 and are interested in phylogenetics, it looks like the place to be.



slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
Biomolecular Engineering Department
Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building
Locations of visitors to pages with this footer (started 3 Nov 2008)