Improbizer

Improbizer searches for motifs in DNA or RNA sequences that occur with improbable frequency (to be just chance) using a variation of the expectation maximization (EM) algorithm.

Enter in sequences in the large box below - either one sequence per line or in FA format. The length of time this takes to run varies with the length of the sequence you paste in and the advanced settings you pick. Longer sequences take more time than the same number of bases in short sequences. 100 sequences of 100 base pairs each will take about a minute.
 

Advanced Settings

Control Runs

Command Line

Advanced Settings

Changing the settings below let's you fine-tune the performance of the Improbizer. Click on the hyperlink to the left of a setting to jump to a description of the setting.

Number of motifs to find:
Ignore Location
:
Include Reverse Complement
Maximum Occurrences per Sequence
Left Align:
Initial motif size:
Restrain Expansionist Tendencies:
Number of Sequences in Initial Scan:
Background Model:
Background Data:

Number of motifs to find: The number of different motifs to look for in your sequences.

Ignore Location: Normally Improbizer considers where in a sequence a pattern is, and will gravitate towards patterns that occur near the same place in each sequence. You definitely want to consider location if your input DNA is all taken from the same place relative to transcription start or a splice site. Otherwise you may consider ignoring location.

Include Reverse Complement: When checked Improbizer will look for the reverse complement of a motif as well as the motif itself. (Currently this is only partially implemented. If checked it will end up disabling the automatic resizing of motifs to fit the data.)

Maximum Occurrences per Sequence: The maximum times you expect motif to occur in a single sequence. Two is usually a good value here. Setting this to one will sharpen your results if there truly is one or less occurrences of the motif you're looking for in each sequence. Setting this to higher numbers will find motifs that tend to occur in clusters at the expense of tending to find short but not necessarily relevant repeating motifs.

Left Align: Whether the program should consider the left side or the right side of the input sequences to be aligned with each other. This is only significant if the sequences are of different sizes. Alternatively you can insert "-" characters as spacers to control the alignment.

Initial motif size: This is the size of the initial pattern Improbizer looks for. Improbizer will expand or contract this size to a certain extent on its own based on the size of the patterns in your sequences. However, especially for weak patterns, or the second or subsequent patterns extracted from one sequence set, it can be useful to experiment with smaller or larger sizes.

Restrain Expansionist Tendencies: This allows you to control the tendency of Improbizer to expand the size of the pattern. Reducing this value will encourage longer patterns. Increasing the value will encourage shorter patterns.

Number of sequences in initial scan: Improbizer works in an iterative fashion. It starts with one pattern. It scans the data set for matches and near matches to that pattern. It then collects the matches and near matches, and averages them together to create a new pattern - one which reflects the sequence data as well as the initial pattern. However the initial patterns must come from somewhere. Improbizer uses all initial-motif-sized subsequences of the first sequences as initial patterns. It runs each of these subsequences over the entire data set for one iteration and then takes the most promising looking subsequences through repeated iterations until the pattern no longer changes to fit the data. Though the final pattern does depend on the initial pattern somewhat, many different initial patterns will converge into the same final pattern. Because of this, and since screening the initial patterns is the slowest part of the process, by default Improbizer only scans the first twenty sequences you enter for initial patterns. Generally this is good enough. Hopefully your data set is concentrated enough that there will be one or more close matches to the pattern you're looking for in the first ten sequences. If you grow impatient waiting for Improbizer results try decreasing the number of sequences in initial scan to five or ten. If you're worried that the motifs may not be in the first members of your sequence set either rearrange your sequence set or increase the number of sequences in the initial scan, and be prepared to wait.

Background Model: Improbizer works by finding patterns that occur more frequently than they should by chance (at higher than background levels). The simple way to estimate how frequently a particular oligonucleotide should occur by chance is to raise one quarter to the power of the number of nucleotides in the sequence. This is what you get when you select the "even" background above. However in some animals, including C. elegans, the chance of encountering an A or T in the genome is quite different from the chance of encountering a C or a G. The Markov 0 model takes this into account by setting the probability of a nucleotide to the fraction that occurs in your input sequences rather than evenly to 0.25. The Markov 1 model extends this further - putting the probability of a nucleotide occuring in the context of the previous nucleotide in the sequence. For instance if the previous nucleotide was T, the program estimates the probability of the next nucleotide being a G by looking at the distribution of nucleotides that follow T in your sequence data. The Markov 2 model takes this even futher - putting the probability of a nucleotide occurring in the context of the previous two nucleotides in the sequence. Statisticians refer to the background as the null model. In my experience Markov 0 is best if you are using the same sequences for foreground and background data (see background data below). If you are using different background data try Markov 1 or Markov 2.

Background Data: This setting lets you select where to get the data used to generate the background model. By default Improbizer will generate the background model from the foreground data you paste in at the top. However there are times when this approach will actually filter out the pattern that you are looking for, especially if the pattern is composed of a short repeat. Alternatively you can paste in some data to use for the background model. Ideally this should be a largish data set that is similar to your foreground data set, but not containing lots of the particular pattern you are looking for. For instance if you are looking for a transcription factor binding site that was associated with flowering in Arabidopsis, you might paste in as foreground data promoter regions from genes you know to be upregulated during flowering, and as background data every Arabidopsis promoter region you can find. For convenience (ok, mostly for my convenience) a few background data sets such as "worm intron 3'" are built into the program, and can be selected above rather than pasted in.

Control Runs

The Improbizer will find the number of motifs you tell it to find, no matter what data you give it to work on. To evaluate the significance of the motif you can look at the score to the left of the motif's profile:

12.1546 @ 174.53 sd 78.44 TTTACATCCGTACATTTT
	t  0.557 0.473 0.564 0.066 0.008 0.042 0.486 0.012 0.163 0.063 0.739 0.013 0.025 0.014 0.693 0.523 0.583 0.465 
	c  0.094 0.105 0.059 0.080 0.720 0.043 0.473 0.928 0.762 0.005 0.089 0.014 0.952 0.086 0.239 0.249 0.110 0.206 
	a  0.234 0.329 0.342 0.661 0.219 0.869 0.021 0.048 0.056 0.460 0.098 0.834 0.019 0.890 0.024 0.213 0.291 0.245 
	g  0.115 0.093 0.035 0.193 0.052 0.045 0.020 0.012 0.019 0.473 0.073 0.139 0.003 0.010 0.045 0.015 0.016 0.084 

That is the 12.1546 in the sample above. Unfortunately the significance of an Improbizer profile score is not straightforward. Scores need to be larger on longer sequences to be significant. If you paste in relatively few sequences the score will also need to be larger than if you paste in many sequences. (50 sequences are recommended). Allowing multiple occurrences per sequence also makes scores need to be higher to be significant.

So how do you know if a score is significant? If you've got an advanced degree in statistics you might be able to figure it out from first principles, though it is not a simple problem to approach this way. Otherwise you could compare the score you get on your data with the score Improbizer finds run on randomly generated data. To do this set up everything (including pasting in your sequences) as you would for a normal Improbizer run, but press the "Start Control Run" button below instead of the "Submit" button at the top of the page. Unfortunately it takes at least as long to do a control run as a regular run - longer actually because you should do several control runs since the scores do vary a bit from one set of random data to another. The Random Scores Tables links will take you to where we've recorded the results of a series of control runs on various sized data under various settings.

Random Scores Table