![]() | |||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||
The first release to the pubic was July 7, 2000, on http://genome.ucsc.edu, and included both the assembly of the May 24 freeze and the substantially larger assembly of the June 15 freeze. All subsequent assemblies have been available at that site as well, although older assemblies are now archived to backup tape to save space.The purpose of this report is to describe the algorithm used by GigAssembler. A description of the assembly itself, and the discoveries that have been made using the assembled working draft sequence, is given in the paper on the working draft genome by the International Sequencing Consortium and related papers [The International Human Genome Sequencing Consortium 2001, Cheung 2001, Tilford 2001, Bentley 2001, Montgomery 2001, Bruls 2001, Riethman 2001, Wolfsberg 2001, Yu 2001, The International SNP Map Working Group 2001, Futreal 2001, Nestler 2001, Tupler 2001, Fahrer 2001, Li 2001, Bock 2001, Pollard 2001, Murray 2001, Clayton 2001]. There are many algorithms that assemble reads from subclones of a single BAC, PAC, cosmid or other clone, or from a whole-genome shotgun of a relatively small and not strongly repetitive genome [Huang 1996, Huang 1999, Green unpublished, Sutton 1995, Bonfield 1995]. The sequencing centers primarily used PHRAP (Green, Hillier, unpublished) to assemble shotgun reads of clones into initial sequence contigs. Myers has pioneered an alternative approach to the assembly of larger genomes, working directly from paired whole-genome shotgun reads [Anson 1999]. This method, embodied in the Celera assembler, was successful for the D. melanogaster genome [Myers 2000] and has been applied to the human genome as well. There has not yet been an opportunity to compare the results of the Celera assembly to GigAssembler's assembly. Most successful assembly algorithms, whether for individual clones or for whole genomes, have been based on greedy methods, and all have certain features in common. They begin by looking for sequence overlaps among the fragments to be assembled, and build up sequence contigs by making the best overlaps first. All must have some heuristics to avoid being confused by repetitive sequence. Some, like CAP [Huang 1996, Huang 1999], have heuristics to detect chimeric fragments in their input, i.e. fragments composed of sequence from two or more nonadjacent places in the genome. Some, like the Celera assembler, CAP3 [Huang 1996] and GigAssembler, build scaffolds that consist of several ordered and oriented sequence contigs separated by sequence gaps. One distinguishing feature of GigAssembler is the variety of information it uses in building scaffolds. Each sequence gap in a GigAssembler scaffold can be bridged by a plasmid end pair, BAC end pair, mRNA, or EST, or be derived directly from ordering information present in a public database accession. The Bellman-Ford algorithm is used to check distance constraints on each scaffold [Cormen 1990, Bonfield 1995]. GigAssembler operates at a very large scale, assembling 2.7 billion bases of the human genome from 4.3 billion bases of input fragments, 1.1 billion bases of EST sequence, two gigabases of paired plasmid reads, and 0.4 billion bases of BAC end sequences.
Assembly Process OverviewThe assembly proceeds according to the following major steps:
The steps will be described in more detail below. Decontaminating and Repeat Masking the SequenceFor the most part the sequencing centers themselves remove bacterial, vector, and other contaminants from the clones before they are deposited in the international public databases. Greg Schuler at NCBI does an additional decontamination step which helps assure that even the older sequenced clones are screened for newly identified contaminants such as bacterial transposons and phages. Checks for rodent contamination and sequences that are a mixture of multiple clones are also made. At the end of this we receive three files containing the sequence for roughly 30,000 clones in roughly 400,000 fragments with known contaminants removed. We also receive a list of dubious clones which are excluded from the assembly. To ease further processing we break up the three large files into a separate file for each clone. We run RepeatMasker (http://ftp.genome.Washington.edu/RM/RepeatMasker.html) using the -q (quick) setting on each clone. The repeat masking is done on our computer cluster and takes about 12 hours. Note that RepeatMasker creates a large number of temporary files in the same directory as the sequence. For acceptable cluster performance therefore we copy the file to be masked to a drive local to the node of the cluster performing the computation, mask it there, and then copy the resulting .out file back to the network drive. Alignment of mRNA, ESTs, BAC Ends and Paired ReadsAlignments of mRNA, ESTs, BAC Ends, and paired plasmid (queries) versus the clone fragments (target) are the raw material for constructing bridges to order and orient the sequence contigs. We developed a program, psLayout, for this purpose. We hope to describe the algorithms used in psLayout in detail in another paper. In brief, psLayout follows the following steps:
PsLayout reports all alignments above a certain minimal quality between a collection of query sequences and a collection of database sequences. PsLayout distinguishes bases contained in repetitive DNA from other bases. All bases in the genomic sequences that RepeatMasker classifies as having 85% or more base identity with a repeating element are considered repetitive, and the remaining bases are considered unique. Unlike many other matching protocols, the repetitive bases are not masked out, but are kept and matches involving them are tallied separately in the alignment scores. A run of matching repetitive bases provides little or no evidence of a genuine overlap between a query and a database sequence, but a run of mismatching repetitive bases can in some circumstances provide strong evidence against a genuine overlap. As an additional measure to help minimize the confounding effect of repeats, especially low copy number repeats which may not be masked by RepeatMasker, the alignments are also passed through a near best in genome filter. In the human genome there exist many regions which have been duplicated over the course of evolutionary time. Most of these duplications are old enough that they have diverged from each other in more than 1% of their bases. In the draft sequence there are overlapping fragments from different BAC clones that cover the same region of the genome. These overlapping fragments only differ from each other where sequencing errors have occurred, typically in substantially less than 1% of the bases since the fragments are assembled from 4x or higher shotgun coverage, and the reads they are based on tend to have been created on modern and well-maintained equipment. On the other hand, the EST and BAC end sequences are single reads from a large variety of sources, and many were done using older more error prone sequencing procedures. A 5% sequencing error rate is not uncommon in these sequences. To help distinguish between true matches and spurious matches due to psuedogenes and other near identically repeated regions we only retain those matches that are near best in genome. Aligments are merged and sorted so that all alignments sharing a common query sequence appear together. A score based largely on percentage base identity is given to each alignment. For each region of the query the best scoring alignment is recorded. Alignments are discarded if they have 1% or greater divergence than the best scoring alignment. For example if an EST aligned to clone A with 96.5% base identity, to clone B with 96% base identity and to clone C with 95% base identity, the alignment to clones A and B would be kept and the alignment to clone C would be discarded. As genomic coverage approaches 100%, the probability of the true match being among those matches that are near best in the genome gets closer to one. Creating the Directory StructureThe GigAssembler program itself operates on fingerprint clone contig at a time. Currently the contigs vary from less then 100,000 bases to more than 60,000,000 bases. GigAssembler reads 13 input files in addition to however many clone sequence files are in the contig, and produces 11 output files for each contig. The input and output files are described in appendix A. It is impractical to specify these all individually in a command line. Instead the command line simply specifies the directory to work in. A collection of a half dozen programs prepare the input from various sources including the Washington Univerity map, the alignments described above, and information provided by NCBI. The result of this is a directory structure with one directory per chromosome and one subdirectory per fingerprint clone contig. The GigAssembler program can then be run in parallel, each contig being assembled by a separate CPU. The GigAssembler ProgramHere we briefly outline the key steps in GigAssembler program itself.
----------------------------------
|||||||||||||||||||||||||
-----------------------------------
where vertical bars indicate matching bases from the region that is considered to align. This region could include some mismatches and indels, but for simplicity, these are not shown. Here the alignment goes to one end of each sequence. However in actual fragment data, because of the presence of low quality data at the ends of many fragments, this is often not the case, even for fragments that should be joined. To classify the unmatched end regions, it is useful at this point to introduce a little terminology, as illustrated in the following diagram of a match.
extension tail
------------------------------------
|||||||||||||||||||||||||
--------------------------------------
tail extension
The non-aligning parts of these two sequences are at the ends and can be divided into "tails" and extensions" as labeled in the picture above. The longer unaligned end on one side or another of an alignment is the extension while the shorter is the tail. To build a raft, GigAssembler assigns a score to each aligning pair. The alignments then processed using the following procedure with the best scoring alignments processed first.
A -----------------
B -------------------
C-------------
D ++++++?????????-------------
The parts of D marked +++ are known to be consistent with the raft because of the C D alignment. The parts marked ---- are ok, because they extend the raft. However the parts marked ???? must also agree with the raft so far. This is checked by looking for alignments between D and the other members of the raft (alignments between D and B, and between D and A). If this checks out D would be added to the raft. The scoring function is crucial here. It is not unusual for the data to conflict. It is important that especially the first joins be based on the strongest matches. The current scoring function strongly favors overlaps that are unique, weakly favors overlaps that are repeat masked, strongly discourages sequence mismatches and inserts within the aligning blocks, and moderately discourages tails. Alignments below a certain threshold of the scoring function are not used to build rafts. The actual C code for the scoring function is included in appendix B.
To build a barge, first the overlap between each clone is calculated by looking at the alignments that were used (but not the ones that were rejected) by the raft-building stage. The clone overlap is just the sum of all the fragment overlaps. Note that the clone overlap gives us relative position but not relative orientation of the two clones since the orientation of fragments within a clone is not necessarily consistent.
A --------------------
-------------------- B
3. Assign "default coordinates" to fragments. This is quite simple. The default coordinate of a fragment is just the barge offset of the clone it is in plus its start position in (database accession) clone coordinates. Default coordinates are then assigned to a raft based on the average of the coordinates of the fragments making up the raft.
AAAAAAAAAAAAAAAAAAAA
a1a1a1a1 a2a2a2a2a2
BBBBBBBBBBBBBBBBBBBBB
b1b1b1b1b1b1 b2b2b2
CCCCCCCCCCCCCCCCCCCC
c1c1c1 c2c2c2c2
As-->--Bs-->--Ae-->--Cs-->--Be-->--Ce
/-->---b2c1----->------|
/ |
/ /->-c2->-\ |
/ / \|
As-->--Bs-->--Ae-->--Cs-->--Be-->--Ce
\ /
\ /
\--->---a1b1a2---->----/
Find the longest, most finished fragment that passes though each section of the raft. Put the best fragment for the first part of the raft into the sequence path. Find an alignment between the best fragment for the first part of the raft and the best fragment for the second part of the raft. Search for a "crossover point" in the alignment where it would be reasonable to switch the sequence path to the next fragment. This crossover point is ideally 200 bases from the end of the larger more finished fragment, but may be adjusted depending on the exact alignment. Repeat steps c and d to extend the sequence path until the end of the raft is reached. Notes:The raft ordering graph built in Steps 4 and 5 specifies a partial order on the midpoints of the rafts from a fingerprint clone contig. Additional constraints on this ordering are given by the distance ranges allowed for each bridge between rafts. The entire collection of partial order and distance constraints is represented by a conjunction of difference constraints, as defined, e.g., in [Cormen 1990], pages 540-543. Each distance constraint has the form x - y <= B, where x and y are variables representing the midpoints of two rafts, and B is a constant, representing a bound on the number of base pairs that separate x and y. If it can be inferred from the partial order that x comes before (5' of) y, then we can specify difference constraints that represent both an upper bound B and a lower bound b on the distance between x and y. The upper bound is expressed as y - x <= B. The lower bound is expressed with the distance constraint x - y <= -b. However, if the order of x and y are unknown, then only an upper bound B on the distance that separates them can be specified as part of a conjunction of distance constraints. This is specified by the two constraints x - y <= B and y - x <= B. A lower bound would require a disjunction of two distance constraints. Finally, if is known that x comes before y but no distance bounds are known, then this fact can be represented by the distance constraint x - y <= 0. The conjunction of distance constraints associated with a raft ordering graph has a feasible solution if and only if there is a placement of the rafts such that their midpoints are linearly ordered in a manner consistent with the partial order in the raft ordering graph, and the distances between the midpoints of bridged pairs of rafts are in the allowed distance ranges. The Bellman-Ford algorithm for single source shortest paths can be used to determine if a conjunction of distance constraints has a feasible solution in time proportional to the number of constraints (bridges) times the number of variables (rafts), as shown in [Cormen 1990]. A similar approach was used for a physical mapping problem by Thayer et al. [Thayer 1999]. The feasibility problem for conjunctions of difference constraints is a special case of the linear programming problem, and can also be solved by linear programming algorithms, but these can be slower. In the GigAssembler algorithm, bridges between rafts are added incrementally in a greedy fashion, and the Bellman-Ford procedure is only used to test feasibility of a new bridge. Because a lower bound cannot be enforced on the distance between the midpoints of two unordered rafts in a conjunction of difference constraints, the solution of the conjunction of difference constraints may give a positioning of the midpoints of the rafts that puts some pairs of rafts too close, possibly overlapping them in a way that contradicts the sequence data. That is why Step 6 is necessary. It would be more elegant to combine disjunctions and conjunctions of distance constraints in specifying the raft layout problem, but unfortunately such a combination leads to an NP-hard feasibility problem [Papadimitriou 1982]. Thus our final layout step employs a simple heuristic approach, rather than an integrated optimization of all constraints. Assessment of the Accuracy of the Orientation and OrderWe created two test sets to assess the accuracy of the Oct. 5 freeze working draft as assembled by GigAssembler. The first, called FinishedContigs, is a collection of 24 clone contigs with a total of 145 clones taken from chromosomes 7, 12, 14, 17, 20, 21, and 22 for which we have finished sequence spanning the entire clone contig. The number of clones per clone contig varies from 4 to 13. We obtained a draft version for 88 of these clones by looking for a previous version of the finished clone in GenBank. The second test set, called Scrambled22, was generated by Ray Wheeler at Neomorphic, Inc. by taking the twelve finished sequence contigs from chromosome 22 and randomly choosing a tiling path of 233 "synthetic" BACs covering them. The sequence of each synthetic BAC was then "draftified" by the introduction of gaps, indels and substitutions in a way that the statistics on the resulting fragments reasonably matched the statistics from fragments of real draft BACs. Finally, the fragments were given random order and orientation. We ran the GigAssembler algorithm on all of the clone contigs from both of these test data sets and compared its predicted order and orientation for the fragments to the true order and orientation of the fragments, as can be derived from the finished sequence. We measured the orientation agreement as the fraction of fragments that were oriented correctly. The average orientation agreement for the FinishedContigs test set was 0.90, and varied from 0.50 (near random guessing) to 1.0 (perfect) among the 23 clone contigs. Performance degraded as the number of fragments per draft clone increased and the size of the fragments decreased. On the 12 contigs of the Scrambled22 test set, the average orientation agreement was about 0.87 and varied from 0.71 to 1.0. To measure the accuracy of the predicted order of the fragments, we counted the number of inversions in the order of the starting positions of the fragments. An inversion occurs when the fragment following fragment A in the predicted order in fact should come before fragment A. For example, if the correct order of the fragments is 1,2,3,4,5,6,7 and the predicted order is 1,5,2,4,7,4,6 then there are 2 inversions, at fragments A=5 and A=7. We measure order agreement as the fraction of fragments in the predicted order where inversions do not occur (excluding the last fragment in the predicted order, which cannot have an inversion). In the above example, the order agreement is 4/6. The average order agreement for the Finished Contigs test set was 0.85, and varied from 0.50 to 1.0 among the 23 clone contigs. On the 12 contigs of the Scrambled22 test set, the average order agreement was 0.83 and varied from 0.74 to 0.93. Conclusions GigAssembler could be improved in several ways. A mechanism to detect misassemblies or chimerism in the initial sequence contigs (fragments) that serve as its inputs could be quite helpful, perhaps along the lines of the mechanism developed for CAP2 [Huang 1996]. Improved use of clone end information might lead to better barge construction in Step 2 of the algorithm as well. Both of these improvements would reduce the rate of artifactual duplication in the draft assembly, which occurs when valid overlaps are not detected and thus the same region of the genome is assembled in two different parts of the assembly. It was estimated that 3% of the Oct 7 working draft genome sequence represented artifactual duplication, some due to lack of complete assembly by GigAssembler, and some due to missed overlaps between fingerprint clone contigs [The International Human Genome Sequencing Consortium 2001]. More use could be made of PHRAP quality scores for individual bases during the assembly as well. Finally, it would be helpful to have some methods to impose an upper bound on the total assembled size of a clone, and to eliminate fragments from the assembly entirely when it appears that they don't belong with the clone, or have other serious problems. The assembly of the working draft of the human genome, while still imperfect, has permitted significant research to go forward, rather than wait years for the finishing of the sequence. In particular, having an assembly has allowed the construction of genome wide gene prediction sets, and the side-by-side comparison of different kinds of genome annotation, including chromosomal band locations, STS positions for genetic, radiation hybrid, YAC-STS and cytogenetic maps, GC content, density of various repeat families, CpG islands, ESTs, mRNAs, SNPs, and both known and predicted genes [The International Human Genome Sequencing Consortium 2001]. These comparisons are possible through the on-line genome browsers at UCSC ( http://genome.ucsc.edu/goldenPath/septTracks.html) and ENSEMBL (http://www.ensembl.org), both of which use the GigAssembler assembly. Some of the discoveries that have been made using these and other tools are described in [The International Human Genome Sequencing Consortium 2001, Cheung 2001, Tilford 2001, Bentley 2001, Montgomery 2001, Bruls 2001, Riethman 2001, Wolfsberg 2001, Yu 2001, The International SNP Map Working Group 2001, Futreal 2001, Nestler 2001, Tupler 2001, Fahrer 2001, Li 2001, Bock 2001, Pollard 2001, Murray 2001, Clayton 2001]. However, the web pages at http://genome.ucsc.edu are currently averaging more than 20,000 hits per day, hence we suspect that the bulk of the new discoveries that this work has enabled have yet to be reported.Acknowledgements D.H. acknowledges support from DOE Grant F603-99ERG2849, Dean Patrick Mantey, and Chancellor M.R. C. Greenwood for the 100 node computational cluster. Thanks to Alan Zahler for his advice and encouragement. We thank Scot Kennedy, Aaron Tomb, Patrick Gavin, Nick Littlestone and Paul Tatarsky for help testing the sequence and for technical support, and Ann Pace and Maria Guarino for administrative support. We thank Bob Waterston, Eric Lander, Francis Collins, Ewan Birney, Greg Schuler, John Sulston, and the rest of the genome analysis group for their advice and additional support. References Altschul, S.F. et al. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17): 3389-402. Anson, E., and Gene Myers. 1999. Algorithms for Whole Genome Shotgun Sequencing. Proc. RECOMB '99, Lyon, France. Bentley, D.R. et al. 2001. The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20 and X. Nature, to appear. Bock, J.B., Matern, H.T., Peden, A.A. & Scheller, R.H. 2001. A genomic perspective on membrane compartment organization. Nature, to appear. Bonfield, J.K., Smith, K.F., and R. Staden. 1995. A new DNA sequence assembly program. Nucleic Acids Research 23(24): 4992-4999. Bruls, T. et al. 2001. A physical map of human chromosome 14. Nature, to appear. Cheung, V.G. et al. 2001. From Chromosomal Aberrations to Genes: Linking the Cytogenetic and Sequence Maps of the Human Genome, Nature, to appear. Clayton, J.D., Kyriacou, C.P. & Reppert, S,M. 2001. Keeping time with the human genome. Nature, to appear. Cormen, T.H., Leiserson, C.E., and R.L. Rivest. 1990. Introduction to Algorithms. MIT Press, Cambridge, Mass. Dunham, I. et al. 1999. The DNA sequence of human chromosome 22. Nature 402: 489-495. Fahrer, A.M., Bazan, J.F., Papathanasiou, P. & Goodnow, C.C. 2001. A genomic view of immunology. Nature, to appear. Futreal, P.A. et al. 2001. Cancer and genomics. Nature, to appear. Green, P. Unpublished. PHRAP-sequence-assembly program. (http://www.genome.washington.edu/UWGC/analysistools/phrap.htm) Hattori, M. et al. 2000. The DNA sequence of human chromosome 21. Nature 405: 311-319. Huang, W., and A. Madan. 1999. CAP3: A DNA Sequence Assembly Program," Genome Research 9(9): 868-877. Huang, X. 1996. An Improved Sequence Assembly Program. Genomics 33: 21-31. The International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature, to appear. The International Human Genome Mapping Consortium. 2001. A physical map of the human genome. Nature, to appear. The International SNP Map Working Group. 2001. A map of human genome sequence variation containing 1.4 million single nucleotide polymorphisms. Nature, to appear. Ji, Y., Eichler, E.E., Schwartz, S, and R.D. Nicholls. 2000. Structure of chromosomal duplicons and their role in mediating human genomic disorders. Genome Research 10: 597-610. Li, W. H., Gu, Z., Wang, H. & Nekrutnko, A. 2001. Evolutionary analyses of the human genome. Nature, to appear. Montgomery, K.T. et al. 2001. A high-resolution map of human chromosome 12. Nature, to appear. Mullikin, J. Unpublished. SASHA. Sanger Center. Murray, A.W. & Marks, D. 2001. Can sequencing shed light on cycling? Nature, to appear. Myers, E.W. et al. 2000. A Whole-Genome Assembly of Drosophila. Science 287(9): 868-877. Nestler, E.J. & Landsman, E. 2001. Learning about addiction from the human draft genome. Nature, to appear. Papadimitriou, C.H, and K. Steiglitz. 1982. Combinatorial Optimization. Prentice-Hall, Inc., Englewood Cliffs, New Jersy. Pollard, T.D. 2001. The human genome, the cytoskeleton and motility. Nature, to appear. Riethman, H.C. et al. 2001. Integration of telomeric sequences with the draft human genome sequence. Nature, to appear. Sutton, G.G., White, O., Adams, M.D., and A.R. Kerlavage. 1995. TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects. Genome Science & Technology 1(1): 9-19. Thayer, E.C., Olson, M.V., and R.M. Karp. 1999. Error Checking and Graphical Representation of Multiple-Complete-Digest (MCD) Restriction-Fragment Maps. Genome Research 9(1): 79-90. Tilford, C.A. et al. 2001. A physical map of the human Y chromosome. Nature, to appear. Tupler, R., Perini, G. & Green, M.R. 2001. Gene expression and the human genome. Nature, to appear. Wolfsberg, T.G., Schuler, G.D. & McEntyre, J. 2001. Guide to the draft genome. Nature, to appear. Yu, A. et al. 2001. Comparison of human genetic and sequence-based physical maps. Nature, to appear. Zhao, S. et al. 2000. Human BAC ends quality assessment and sequence analyses. Genomics 63: 321-332.
Appendix A - GigAssembler Inputs and OutputsInputsThe following input files are expected to be in the contig directory before running GigAssembler.
OutputsFollowing are the outputs of GigAssembler. Many of the output file names include a numerical suffix which corresponds to the GigAssembler version number, currently 93, in order to facilitate comparisons between versions. Since the version number changes rapidly it is represented as "NN" in the file names below.
Appendix B - Alignment Scoring FunctionsScoring for Clone Fragment Overlap in Raft Building:int fragOverlapScore(struct psl *alignment)
/* Return score from roughly 500 to -500 for fragment/fragment
overlap based on alignment. */
{
int milliBad; /* Misalignments in rough parts per thousand. */
int startTail; /* Size of starting tail. */
int endTail; /* Size of ending tail. */
/* Calculate basic score. */
milliBad = calcMilliBad(alignment);
findTails(alignment, &startTail, &endTail);
score = -30*milliBad - (startTail+endTail)/2 +
round(25*log(alignment->match+1) + log(alignment->repMatch+1));
/* Add some penalties for not having many unique matches or being
part of a small fragment. */
if (!enclosingPsl(alignment))
{
if (alignment->match < 20)
score -= (20-alignment->match)*25
if (alignment->qSize < 5000)
score -= (5000-alignment->qSize)/40;
}
return score;
}
mRNA Scoring Functionint scoreMrnaPsl(struct psl *ali, boolean isEst)
/* Return score for one mRNA oriented psl. */
{
int milliBad;
int score;
milliBad = calcMilliBad(ali, TRUE);
score = 25*log(1+ali->match) + log(1+ali->repMatch) - 10*milliBad + 10;
if (ali->match <= 10)
score -= (10-ali->match)*25;
if (isEst)
score -= 25;
else
score += 25;
return score;
}
BAC End Pair/Plasmid Pair Scoring Functionint scorePairedPsl(struct psl *ali, int mapDistance, boolean isBac)
/* Return score for one BAC end oriented psl. */
{
int milliBad;
int score;
int startTail, endTail;
milliBad = calcMilliBad(ali, FALSE);
findTails(ali, &startTail, &endTail);
score = 25*log(1+ali->match)
+ log(1+ali->repMatch) - 20*milliBad - 4*(startTail+endTail);
if (ali->match <= 20)
score -= (20-ali->match)*25;
if (mapDistance > 200000)
score -= (mapDistance-200000)/1500;
if (isBac)
score -= 50;
return score;
}
Functions used by other scoring functions:int calcMilliBad(struct psl *ali, boolean isMrna)
/* Return misaligning ratio in roughly parts per thousand. */
{
int qAliSize, tAliSize, aliSize;
int milliBad;
int sizeDif;
int insertFactor;
qAliSize = ali->qEnd - ali->qStart;
tAliSize = ali->tEnd - ali->tStart;
aliSize = min(qAliSize, tAliSize);
if (aliSize <= 0)
return 0;
sizeDif = qAliSize - tAliSize;
if (sizeDif < 0)
{
if (isMrna)
sizeDif = 0;
else
sizeDif = -sizeDif;
}
insertFactor = ali->qNumInsert;
if (!isMrna)
insertFactor += ali->tNumInsert;
milliBad =
(1000 * (ali->misMatch + insertFactor + round(3*log(1+sizeDif)))) / aliSize;
return milliBad;
}
boolean enclosingPsl(struct psl *ali)
/* Return TRUE if one fragment is entirely enclosed by another. */
{
if (ali->tStart <= 300 && ali->tEnd + 300 >= ali->tSize)
return TRUE;
if (ali->qStart <= 300 && ali->qEnd + 300 >= ali->qSize)
return TRUE;
return FALSE;
}
void findTails(struct psl *ali, int *retStartTail, int *retEndTail)
/* Find the length of "tails" (rather than extensions) implied by ali. */
{
int orientation = ali->orientation;
int qFloppyStart, qFloppyEnd;
int tFloppyStart, tFloppyEnd;
if (orientation > 0)
{
qFloppyStart = ali->qStart;
qFloppyEnd = ali->qSize - ali->qEnd;
}
else
{
qFloppyStart = ali->qSize - ali->qEnd;
qFloppyEnd = ali->qStart;
}
tFloppyStart = ali->tStart;
tFloppyEnd = ali->tSize - ali->tEnd;
*retStartTail = min(qFloppyStart, tFloppyStart);
*retEndTail = min(qFloppyEnd, tFloppyEnd);
}
| | ||||||||||||||||||||||||||||||