Gibbs: Gibbs Sampling for Aligning RNA
Discovery of common secondary structure in unaligned RNA sequences
is a challenging problem.
Alignment of multiple RNA sequences entails finding the
common structure among all the sequences.
RNA secondary structure is dominated by the formation of
helixes, which are regions where the two
strands come together and form base pairs.
Searching for the optimal set of helical regions among
multiple sequences is a vast search space.
The first problem is that you don't know what your are
looking for yet; you have to discover the structure.
Then after you think you have a structure, you have to
see how each of the sequences fits that structure.
We attack this problem with a random search technique
known as Gibbs Sampling.
The paper and others on the RNA work performed here
at University of California, Santa Cruz are available
in the
UCSC Computational Biology
group's
RNA FTP
directory.
Specific papers of interest
-
L. Grate and M. Herbster and R. Hughey and I.S. Mian and H. Noller and D. Haussler
RNA Modeling Using Gibbs Sampling and Stochastic Context Free Grammars.
Proceedings, 2nd International Conference on
Intelligent Systems for Molecular Biology
, 235:1501--1531, February 1994.
Abstract
A new method of discovering the common secondary structure of a family
of homologous RNA sequences using Gibbs sampling and
stochastic context-free grammars is proposed.
Given an unaligned set of sequences, a Gibbs sampling step
simultaneously estimates the secondary structure
of each sequence and a set of statistical parameters describing the
common secondary structure of the set as a whole.
These parameters describe a statistical model of the family.
After the Gibbs sampling has produced a crude statistical model for the
family, this model is translated into a stochastic context-free grammar,
which is then refined by an Expectation Maximization (EM)
procedure to produce a more complete model.
A prototype implementation of the method is tested on tRNA, pieces
of 16S rRNA and on U5 snRNA with good results.