UCSC CMPE 280B Fall 2001 Home Page

CMPE 280B Fall 2001 Home Page

Bioinformatics Research Seminar

This course is a weekly research seminar that assumes that students have substantial background in biology, chemistry, computer science, or statistics.

Room:: Social Science 2, room 159
Time:: 12-1:35 Tuesdays

The seminar will be a journal club, in which students take turns presenting papers from the literature (or their own research). I would like a title and abstract from each presenter at least a week ahead of time to put on this web page.

Tentative schedule

20 Sept 2001 Administrative details, choosing dates

25 Sept 2001 Andy Pohl Clustering Gene Expression Data, and then Annotating the Clusters

Annotating clustered expression data is typically very tedious and requires expertise. I present a way of annotating clustered data using information retrieval methods on abstracts from MEDLINE associated with descriptions for expressed sequence tags obtained from SwissProt. Using CAST clustering, the set of sequences is divided into a variable number of sets of sequences (without replacement). Using hierarchical clustering, a variable number of sets of sequences can also be produced from the dendrogram (with replacement) to be used similar to CAST. With the sets, rough annotations in the form of word rankings are produced with a simple log-odds probability ratio. The probabilities for the ratio are calculated in three ways, one of which is the popular bag-of-words method. The results of experiments were mixed. Many of the words returned were not useful, or were in the descriptions from SwissProt. But some of the words were helpful, and the words seemed to be taken from different levels of abstraction based on the type of probability used in the log-odds ratio.

2 Oct 2001 (Karplus out of town)

9 Oct 2001 Daryl Thomas

16 Oct 2001 Francisco Useche High-throughput identification, database storage and analysis of SNPs in EST sequences

Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA variation and disease-causing mutations in many organisms. Due to their abundance and slow mutation rate within generations, they are thought to be the next generation of genetic markers that can be used in a myriad of important biological, genetic, pharmacological, and medical applications. There are several strategies both experimental, and in-silico for SNP discovery and mapping.

Experimental SNP discovery consists of a number of labourious steps that make this process complex and expensive. Therefore, in-silico discovery has been proposed to overcome the above problem. However, in order to successfully apply the in-silico method to large data sets, the following challenges need to be addressed: First it is necessary to build an integrated SNP pipeline that handles data processing steps smoothly from the beginning (collecting sequence information) to end (SNP information stored in a database). Also, SNP detection tool parameters have to be optimized to satisfy specific goals of the project. Finally, SNP data could not be fully used until the in-silico method is validated experimentally.

In this work it is presented a design and implementation of an in-silico SNP detection software pipeline that exploits the existence of large EST (expressed sequence tag) data sets and effectively addresses the above challenges. First, the pipeline allows for smooth data transition between its different components by implementing data interfaces that translate the data formats of the different tools in the different stages. Second, we optimized PolyBayes parameters for SNP detection in maize EST. Finally, we implemented a user interface that along with the database structure created, allows the scientist to perform preliminary analysis of the data and to perform basic statistics on the SNP data prior to experimental validation.

The pipeline works with two different types of sequence assemblers PHRAP and CAT--from DoubleTwist. It uses a Bayesian engine for SNP detection (PolyBayes), selects relevant polymorphism information which is then uploaded into a database. We detected 2439 SNPs and 822 insertion deletions (INDELs) with a PolyBayes probability higher than 0.99 on the public set of 68,000 maize ESTs coming from the ZmDB(Zea maize DB).

The user interface allowed us analyzing the polymorphism information right after discovery in several ways that allowed us to gain insight into the distribution and significance of the newly acquired data.

23 Oct 2001 Ryan Weber

The main application considered in this talk is predicting true kinases from randomly permuted kinases that share the same length and amino acid distributions as the true kinases. Numerous methods already exist for this classification task, such as HMMs, motif-matchers, and sequence comparison algorithms. We build on some of these efforts by creating a vector from the output of thousands of structurally based HMMs, created offline with all 2866 Pfam-A seed alignments using SAM-T99, which then must be combined into an overall classification for the protein. Then we use a Support Vector Machine for classifying this large ensemble Pfam-Vector, with a polynomial and chi-squared kernel. These vectors of HMM E-values can be constructed completely in parallel and the 'kernel trick', combined with a cache of previous kernel evaluations makes the SVM computation relatively fast too. The chi-squared kernel SVM performs better than the HMMs and better than the BLAST pairwise comparisons, when predicting true from false kinases in some respects, but no one algorithm is best for all purposes or in all instances so we consider the particular strengths and weaknesses of each.

In addition, we consider the multi-class problem of classifying a known kinase into one of a set of families and also sub-families, based on Hank's classification hierarchy. In this experiment we compare the one-vs-one approach to one-vs-rest for multi-class classification using the Pfam-Vector, and a simple nearest-neighbor classifier. However, in this case the SAM-T99 HMMs built specially for the Hanks families and sub-families are clearly the most accurate.

30 Oct 2001 Hongyun Wang (UCSC Applied Math) Binding Zipper: The energy transduction mechanism of the F1 ATP synthase

We discuss the molecular mechanism by which the F1 ATP synthase converts ATP hydrolysis free energy into a mechanical torque and vice versa. In the hydrolysis cycle, the ATP binding consists of two major parts: (i) ATP diffusing into the catalytic site from solution (ATP docking) and (ii) the multi-step transition from weak binding to tight binding as bonds form sequentially between ATP and the catalytic site (binding transition). The force is generated at the catalytic site during the binding transition. We call this process the binding zipper. Nucleotide hydrolysis weakens the binding sufficiently and distributes the binding over two products so that the products can be released and the reaction cycle can repeat. The same mechanism may operate in other motors driven by ATP hydrolysis. The force generated in the binding transition may be stored in the enzyme and released in the subsequent reaction steps. So viewed from outside, it may appear that the force is generated in other reaction steps.

6 Nov 2001 (class skipped)

13 Nov 2001 Regine Horteur-Edjlali, presenting a journal paper: "The Bioinformatics Template Library---generic components for biocomputing" Bioinformatics 17(8):729-737 (August 2001).

Authors:
W R Pitt, M A Williams, M Steven, B. Sweeney, A J Bleasby and D S Moss

Abstract:
Motivation: The efficiency of bioinformatics programmers can be greatly increased through the provision of ready-made software components that can be rapidly combined, with additional bespoke components where necessary, to create finished programs. The new standard for C++ includes an efficient and easy to use library of generic algorithms and data-structures, designed to facilitate low-level component programming. The extension of this library to include functionality that is specifically useful in compute-intensive tasks in bioinformatics and molecular modelling could provide an effective standard for the design of reusable software components within the biocomputing community.

Results: A novel application of generic programming techniques in the form of a library of C++ components called the Bioinformatics Template Library (BTL) is presented. This library will facilitate the rapid development of efficient programs by providing efficient code for many algorithms and data-structures that are commonly used in biocomputing, in a generic form that allows them to be flexibly combined with application specific object-oriented class libraries.

Availibility: The BTL is available free of charge from our web site http://www.embl-ebi.ac.uk/FTP/index.html.

Contact: d.moss@mail.cryst.bbk.ac.uk m.williams@biochemistry.ucl.ac.uk

20 Nov 2001 Keith Hoffman

Evaluation of Gene-Finding Programs on Mammalian Sequences by Sanja Rogic, Alan K. Mackworth, and Francis B.F. Ouellette Genome Research Vol. 11, Issue 5, 817-832, May 2001.

Comparative analyses of early gene-finding programs were performed in the mid-90s. Since then, several new gene-finding programs have been developed, dating these earlier comparisons. Additionally, these early comparisons were limited by lack of training and test data separation. Rogic, Mackworth, and Ouelette (2001) performed an updated evaluation of recent gene-finding programs using a standardized datset of human, mouse, and rat genes. They discussed the success of each program in terms of G+C content, exon length, and exon type. The usefulness of exon scores and the impact of phylogenetic specificity on those scores are also discussed. I will discuss the results of their tests for each of the seven gene-finding programs as well as their conclusions regarding the state of gene-finding.

27 Nov 2001 Yael Mandel-Gutfreund Do Alternating Binary Patterns of Polar and Non-Polar Amino Acids Direct Proteins to Fold or to Aggregate?

The sequestration of non-polar residues in the core of globular proteins is generally agreed to be a dominant force in protein folding and stability. Polar(P)/non-polar(N) sequence patterning can be represented by a "binary code", which specifies only the type of amino acid at each position and not its precise identity. We have studied the prevalence of binary patterns in beta-strands and the association between complementary patterns in antiparallel sheets in order to assess their importance for sheet formation.

In contrast to a recent study ( Broome B.M. and Hecht M.H. Nature disfavors sequences of alternating polar and non-polar amino acids: implications for amyloidogenesis J Mol Biol 2000 Mar 3;296(4):961-8 ) that suggested that alternating polar/non-polar patterns are disfavored and are under-represented in natural proteins to avoid aggregation, we found that the most frequent patterns in beta-strands are the purely alternating patterns (PNPNP and NPNPN). Moreover, we observed a highly significant preference for association between complementary patterns, in which the hydrophobic and polar residues pair with one other. To examine Broome and Hecht's hypothesis the occurrence of binary patterns in amyloidogenic proteins and in short fragments involved directly in amyloid formation has been investigated. Based on our results we propose that alternating patterns are important for the natural formation of beta-sheets in proteins and are not strongly associated with their self-assembly in pathological situations.

[an error occurred while processing this directive]