[Genome] reply: calculating "score" of blat search
tom
tom at cyber-dyne.com
Tue Dec 4 23:44:37 PST 2001
>when doing blat search over the ucsc website, there is a field named
'score', which is not in the output of the local copy of blat program I
>have. How are the scores calculated for blat?
Consider taking your favorite gene or protein, pasting it into the Blat
gateway and experimenting: make various levels and kinds of substitutions
and add to a fasta stack. Upon submission, you can readily get at the
basic scoring system: score = matches - mismatches. That is, for 100 aa
in one exon, a perfect alignment gets a score of 100 whereas a one amino
acid change gives a score of 99-1=98. There is no effort to make a
estimate of statistical signficance as in Blast; Blat by design focuses on
finding higher quality matches quickly.
However, it is important to keep in mind that Blat first has to find an
alignment within the genome. It is only relative to this alignment that
score = matches - mismatches can be calculated.
Thus it gets more complicated for insertions and deletions, especially if
these or point changes throw off the ability of Blat to make the alignment
to begin with. In an extreme case, a single base change in a minimal
length exon could cause Blat not to find the exon at all. Hence when it
came time to subtract the mismatch from the matches, there wouldn't be any
matches. This would give a misleadingly low score for a good sequence that
did in fact have this exon within the genome.
Thus it is a good idea to consider the score along with query size and
percent identity, and also to view the alignment itself within the details
page. No single definition of score can adequately capture the nuances of
all situations that arise. In practise it works quite well to sort Blat
output by score; often the lower scores can be discarded.
Blat uses other types of scores internally in the process of building its
alignments. These reflect gap lengths and differ according to query type
being processed. Blat, as used to create certain genomewide tracks, is
somewhat different from the online version for short queries which is of
necessity less computationally intense. Details of internal scoring schemes
of Blat and its overall n-mer alignment seed strategy will be described
shortly in a paper by WJ Kent.
More information about the Genome
mailing list