Home   -   Genome Browser   -   Blat Search   -   FAQ   -   User Guide  
 

GigAssembler: An Algorithm for the Initial Assembly
of the Human Genome Working Draft

 

W. James Kent, David Haussler

 

UCSC-CRL-00-17

December 27, 2000

 

 

Department of Biology,

University of California at Santa Cruz

Santa Cruz, CA 95064 USA

 

Howard Hughes Medical Institute

Department of Computer Science

University of California at Santa Cruz

Santa Cruz, CA 95064 USA

 

 

ABSTRACT

The data for the public working draft of the human genome contains roughly 400,000 sequence fragments in approximately 30,000 clones. Many of these sequence fragments overlap. A program, GigAssembler, was built to merge overlapping fragments, and to order and orient the resulting larger fragments based on mRNA, paired plasmid ends, EST, BAC end pairs, and other information. This program produced the first publicly available assembly of the human genome, an assembled working draft of the human genome containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genome. Here we describe the algorithm used by GigAssembler.

 

Introduction

On May 24, 2000, the public Human Genome Project staged the first "freeze" of all currently available sequence data, coordinated by the director, Francis Collins, Greg Schuler at the National Center for Biotechnology Information, Adam Felsenfeld at the National Human Genome Research Institute, and the sixteen primary public human sequencing centers. Public database accessions for approximately 22,000 shotgun-sequenced clones were selected for this freeze, mostly Bacterial Artificial Chromosome (BAC) clones [The International Human Genome Sequencing Consortium 2001]. The sequence contigs were extracted from these clones and cleaned up as necessary by Schuler. We will refer to these contigs as initial sequence contigs, or "fragments" for short. There were about 375,000 such fragments. The complete public human genome sequence is not projected to be available until 2003. To get a useable working draft in the short term, it was necessary to order and orient these fragments of genome sequence along the 20 unfinished autosomes and two sex chromosomes as best as the data permitted, detecting overlaps and building larger sequence contigs where possible.

A group led by Robert Waterston at the Washington University Genome Sequencing Center (WUGSC) created a map of the clones from the May 24 freeze, based on the genome-wide physical map they had developed of approximately 300,000 clones, primarily using fingerprint overlaps, but also employing information from radiation hybrid, genetic, YAC-STS, and cytogenetic maps, as well as BAC end matches [The International Human Genome Sequencing Consortium 2001]. This map consisted of some 1700 fingerprint clone contigs, each with an approximate chromosomal location, plus a few additional contigs that could not be reliably placed on a chromosome. The end points of the individual clones, as well as their overlaps and relative order along the chromosome, were only very roughly determined in these contigs. Thus the problem of clone order needed to be solved along with the problem of fragment order and orientation. Fragments from different clones within a contig often showed long sequence overlaps, giving strong evidence of clone order, but not giving an entirely unambiguous signal due to the occasional presence of near exact duplicated regions [Ji 2000].

These two problems were somewhat intertwined: clone order information from the fingerprint map was needed during the assembly of the fragments within a fingerprint clone contig, and this assembly led to refined clone order.

Further evidence of fragment order and orientation was obtained from sequence matches between the fragments and mRNA or EST sequences. These matches help to order and orient the fragments that contain the exons of a gene, even if these exons are separated by quite long introns, improving the usefulness of the working draft for the study of genes. A greater number of matches could be found between the fragments from the shotgun-sequenced clones from the freeze and the paired ends of approximately 500,000 BAC clones that were only end-sequenced [Zhao 2000]. These matches also provide order and orientation information for the fragments, but can be misleading because more than 15% of the BAC end sequences are mispaired [Zhao 2000].

A greedy [Cormen 1990] algorithm, called GigAssembler was developed to use the fragment, map, mRNA, EST and BAC end data to assemble the genome sequence of the May 24 freeze. The resulting assembly, produced in mid June, consisted of 2,182,660,273 base pairs covering about 70% of the genome. Since that time significant amounts of new human sequence has been added to the public databases, and new freezes have been declared periodically. The GigAssembler algorithm has developed further as well during that time. Starting with the September freeze, matches from the paired end sequences of plasmid clones were added to improve the ordering and orientation. These were taken from approximately one million genome-wide random reads from the Genome Center at the Whitehead Institute for Biomedical Research (WIBR) in the assembly of the September 5 freeze, and about one million further reads from the Sanger Centre and WUGSC in the October 7 freeze, made available through the SNP consortium, http://snp.cshl.org. In addition, ordering and orientation information for the fragments that is contained in some of the public database accessions was extracted and used by GigAssembler in later assemblies, along with information from assembled contigs of finished sequence ("NT contigs"). A history of the assemblies is given in Table 1.

 

Table 1: Working Drafts Produced by GigAssembler

Freeze

Input (GB)

Output (GB)

% complete

May 24, 2000

3.3

2.2

70

June 15, 2000

3.6

2.5

82

July 17, 2000

4

2.7

87

Sept. 5, 2000

4.1

2.7

87

Oct. 7, 2000

4.2

2.7

88

The first release to the pubic was July 7, 2000, on http://genome.ucsc.edu, and included both the assembly of the May 24 freeze and the substantially larger assembly of the June 15 freeze. All subsequent assemblies have been available at that site as well, although older assemblies are now archived to backup tape to save space.

The purpose of this report is to describe the algorithm used by GigAssembler. A description of the assembly itself, and the discoveries that have been made using the assembled working draft sequence, is given in the paper on the working draft genome by the International Sequencing Consortium and related papers [The International Human Genome Sequencing Consortium 2001, Cheung 2001, Tilford 2001, Bentley 2001, Montgomery 2001, Bruls 2001, Riethman 2001, Wolfsberg 2001, Yu 2001, The International SNP Map Working Group 2001, Futreal 2001, Nestler 2001, Tupler 2001, Fahrer 2001, Li 2001, Bock 2001, Pollard 2001, Murray 2001, Clayton 2001].

There are many algorithms that assemble reads from subclones of a single BAC, PAC, cosmid or other clone, or from a whole-genome shotgun of a relatively small and not strongly repetitive genome [Huang 1996, Huang 1999, Green unpublished, Sutton 1995, Bonfield 1995]. The sequencing centers primarily used PHRAP (Green, Hillier, unpublished) to assemble shotgun reads of clones into initial sequence contigs. Myers has pioneered an alternative approach to the assembly of larger genomes, working directly from paired whole-genome shotgun reads [Anson 1999]. This method, embodied in the Celera assembler, was successful for the D. melanogaster genome [Myers 2000] and has been applied to the human genome as well. There has not yet been an opportunity to compare the results of the Celera assembly to GigAssembler's assembly.

Most successful assembly algorithms, whether for individual clones or for whole genomes, have been based on greedy methods, and all have certain features in common. They begin by looking for sequence overlaps among the fragments to be assembled, and build up sequence contigs by making the best overlaps first. All must have some heuristics to avoid being confused by repetitive sequence. Some, like CAP [Huang 1996, Huang 1999], have heuristics to detect chimeric fragments in their input, i.e. fragments composed of sequence from two or more nonadjacent places in the genome. Some, like the Celera assembler, CAP3 [Huang 1996] and GigAssembler, build scaffolds that consist of several ordered and oriented sequence contigs separated by sequence gaps. One distinguishing feature of GigAssembler is the variety of information it uses in building scaffolds. Each sequence gap in a GigAssembler scaffold can be bridged by a plasmid end pair, BAC end pair, mRNA, or EST, or be derived directly from ordering information present in a public database accession. The Bellman-Ford algorithm is used to check distance constraints on each scaffold [Cormen 1990, Bonfield 1995]. GigAssembler operates at a very large scale, assembling 2.7 billion bases of the human genome from 4.3 billion bases of input fragments, 1.1 billion bases of EST sequence, two gigabases of paired plasmid reads, and 0.4 billion bases of BAC end sequences.

 

Assembly Process Overview

The assembly proceeds according to the following major steps:

  1. Decontaminating and repeat masking the sequence.
  2. Alignment of mRNA, EST, BAC end, and paired plasmid reads against genomic fragments. On a cluster of one hundred 800 MhZ Pentium III CPUs running Linux this takes about three days.
  3. Creating an input directory structure with using Washington University map and other data. This step takes about an hour on a single computer.
  4. For each fingerprint clone contig, aligning the fragments within that contig against each other. This takes about three hours on the cluster.
  5. Using the GigAssembler program within each fingerprint clone contig to merge overlapping fragments and to order and orient the resulting sequence contigs into scaffolds. This takes about two hours on the cluster.
  6. Combining the contig assemblies into full chromosome assemblies. This takes about twenty minutes on one computer.

The steps will be described in more detail below.

Decontaminating and Repeat Masking the Sequence

For the most part the sequencing centers themselves remove bacterial, vector, and other contaminants from the clones before they are deposited in the international public databases. Greg Schuler at NCBI does an additional decontamination step which helps assure that even the older sequenced clones are screened for newly identified contaminants such as bacterial transposons and phages. Checks for rodent contamination and sequences that are a mixture of multiple clones are also made. At the end of this we receive three files containing the sequence for roughly 30,000 clones in roughly 400,000 fragments with known contaminants removed. We also receive a list of dubious clones which are excluded from the assembly.

To ease further processing we break up the three large files into a separate file for each clone. We run RepeatMasker (http://ftp.genome.Washington.edu/RM/RepeatMasker.html) using the -q (quick) setting on each clone. The repeat masking is done on our computer cluster and takes about 12 hours. Note that RepeatMasker creates a large number of temporary files in the same directory as the sequence. For acceptable cluster performance therefore we copy the file to be masked to a drive local to the node of the cluster performing the computation, mask it there, and then copy the resulting .out file back to the network drive.

Alignment of mRNA, ESTs, BAC Ends and Paired Reads

Alignments of mRNA, ESTs, BAC Ends, and paired plasmid (queries) versus the clone fragments (target) are the raw material for constructing bridges to order and orient the sequence contigs. We developed a program, psLayout, for this purpose. We hope to describe the algorithms used in psLayout in detail in another paper. In brief, psLayout follows the following steps:

  1. Build up an index of all of the 10-mers in a 30 megabase subset of the target sequence. Exclude simple repeating elements from this index.
  2. Break up the query sequence into overlapping 500 base regions.
  3. Look up the 10-mers that occur in the 500 base region in the index and identify clusters of matching 10-mers in the target sequence which represent likely areas of homology.
  4. Do a detailed alignment between the 500 base region and the sections of the target sequence identified in step 3. In the case of mRNAs and ESTs this alignment is done in a fashion that tolerates introns.
  5. Stitch together the alignments from overlapping 500 base regions using a dynamic programming algorithm.

PsLayout reports all alignments above a certain minimal quality between a collection of query sequences and a collection of database sequences. PsLayout distinguishes bases contained in repetitive DNA from other bases. All bases in the genomic sequences that RepeatMasker classifies as having 85% or more base identity with a repeating element are considered repetitive, and the remaining bases are considered unique. Unlike many other matching protocols, the repetitive bases are not masked out, but are kept and matches involving them are tallied separately in the alignment scores. A run of matching repetitive bases provides little or no evidence of a genuine overlap between a query and a database sequence, but a run of mismatching repetitive bases can in some circumstances provide strong evidence against a genuine overlap.

As an additional measure to help minimize the confounding effect of repeats, especially low copy number repeats which may not be masked by RepeatMasker, the alignments are also passed through a near best in genome filter. In the human genome there exist many regions which have been duplicated over the course of evolutionary time. Most of these duplications are old enough that they have diverged from each other in more than 1% of their bases. In the draft sequence there are overlapping fragments from different BAC clones that cover the same region of the genome. These overlapping fragments only differ from each other where sequencing errors have occurred, typically in substantially less than 1% of the bases since the fragments are assembled from 4x or higher shotgun coverage, and the reads they are based on tend to have been created on modern and well-maintained equipment. On the other hand, the EST and BAC end sequences are single reads from a large variety of sources, and many were done using older more error prone sequencing procedures. A 5% sequencing error rate is not uncommon in these sequences. To help distinguish between true matches and spurious matches due to psuedogenes and other near identically repeated regions we only retain those matches that are near best in genome. Aligments are merged and sorted so that all alignments sharing a common query sequence appear together. A score based largely on percentage base identity is given to each alignment. For each region of the query the best scoring alignment is recorded. Alignments are discarded if they have 1% or greater divergence than the best scoring alignment. For example if an EST aligned to clone A with 96.5% base identity, to clone B with 96% base identity and to clone C with 95% base identity, the alignment to clones A and B would be kept and the alignment to clone C would be discarded. As genomic coverage approaches 100%, the probability of the true match being among those matches that are near best in the genome gets closer to one.

Creating the Directory Structure

The GigAssembler program itself operates on fingerprint clone contig at a time. Currently the contigs vary from less then 100,000 bases to more than 60,000,000 bases. GigAssembler reads 13 input files in addition to however many clone sequence files are in the contig, and produces 11 output files for each contig. The input and output files are described in appendix A. It is impractical to specify these all individually in a command line. Instead the command line simply specifies the directory to work in. A collection of a half dozen programs prepare the input from various sources including the Washington Univerity map, the alignments described above, and information provided by NCBI. The result of this is a directory structure with one directory per chromosome and one subdirectory per fingerprint clone contig. The GigAssembler program can then be run in parallel, each contig being assembled by a separate CPU.

The GigAssembler Program

Here we briefly outline the key steps in GigAssembler program itself.

  1. Build merged sequence contigs (called "rafts" for short) from overlapping initial sequence contigs ("fragments"). A fragment is either a sequence contigs from the accession of a draft clone, or constructed contig of finished sequence (A.K.A. an "NT contig") prepared at NCBI by Greg Schuler. In a perfect world the alignment joining two fragments would look like so:
      ----------------------------------
               |||||||||||||||||||||||||
               -----------------------------------

where vertical bars indicate matching bases from the region that is considered to align. This region could include some mismatches and indels, but for simplicity, these are not shown. Here the alignment goes to one end of each sequence. However in actual fragment data, because of the presence of low quality data at the ends of many fragments, this is often not the case, even for fragments that should be joined. To classify the unmatched end regions, it is useful at this point to introduce a little terminology, as illustrated in the following diagram of a match.

      extension                         tail

      ------------------------------------
               |||||||||||||||||||||||||
            --------------------------------------

          tail                           extension

 

The non-aligning parts of these two sequences are at the ends and can be divided into "tails" and extensions" as labeled in the picture above. The longer unaligned end on one side or another of an alignment is the extension while the shorter is the tail.

To build a raft, GigAssembler assigns a score to each aligning pair. The alignments then processed using the following procedure with the best scoring alignments processed first.

  1. If neither fragment in the alignment is in a raft make a raft out of the two.
  2. If one fragment is in a raft but not the other, the other fragment is added to the raft if it does not conflict with what is already in the raft. Consider the raft of the fragments A, B, C, and an alignment involving C and D:
  3.                  A -----------------
                          B -------------------
                        C-------------
                              D ++++++?????????-------------

     

    The parts of D marked +++ are known to be consistent with the raft because of the C D alignment. The parts marked ---- are ok, because they extend the raft. However the parts marked ???? must also agree with the raft so far. This is checked by looking for alignments between D and the other members of the raft (alignments between D and B, and between D and A). If this checks out D would be added to the raft.

  4. If one fragment is in a raft and the other in another raft, the two rafts are merged, using a procedure like that described for case (b).

The scoring function is crucial here. It is not unusual for the data to conflict. It is important that especially the first joins be based on the strongest matches. The current scoring function strongly favors overlaps that are unique, weakly favors overlaps that are repeat masked, strongly discourages sequence mismatches and inserts within the aligning blocks, and moderately discourages tails. Alignments below a certain threshold of the scoring function are not used to build rafts. The actual C code for the scoring function is included in appendix B.

  1. Build sequenced clone contigs (called "barges" for short) from overlapping clones. This procedure is somewhat similar to raft building. Barges are called barges because they are similar to rafts, but are built from larger pieces.

To build a barge, first the overlap between each clone is calculated by looking at the alignments that were used (but not the ones that were rejected) by the raft-building stage. The clone overlap is just the sum of all the fragment overlaps. Note that the clone overlap gives us relative position but not relative orientation of the two clones since the orientation of fragments within a clone is not necessarily consistent.

The overlaps are used to order the clones in the following manner.
      1. Clones that are completely enclosed by another clone are put aside.
      2. A clone is selected and the most overlapping other clone is joined with it to initialize an ordered list of clones:
      3.  
                             A --------------------
                                        -------------------- B
      4. Given an ordered list of clones ABCD, while there are still clones in the fingerprint clone contig that significantly overlap clones in this list, the clone X that overlaps as much as possible with another element on the list is selected and inserted into the list as follows:
        1. If X overlaps A but not B, the order becomes XABCD
        2. If a clone overlaps with the first part of B but not the last part of A, it is taken as evidence of a repeat or a misassembled or chimeric fragment and excluded from the ordered list.
        3. Otherwise if X overlaps A and B, and overlap(A,B) < overlap(A,X) and overlap(A,B) < overlap(B,X) then the order becomes AXBCD
        4. Otherwise steps i and ii are repeated shifting the list so that C is considered in place of B and B in place of A.
        5. If there are still clones left in the fingerprint clone contig that have not been added to the ordered list after the iterations of the above steps cease, then a new barge is started to accommodate the remaining clones.
      1. The order of the clones in each barge is compared to the fingerprint map, and if the barge looks backwards the order of its clones is reversed..
      2. Barge coordinates are given in the following manner:
        1. The first clone is given an offset of zero.
        2. For the nth clone, Offset(n+1) = Offset(n) + Size(n) - Overlap(n,n+1)
      1. Clones that are completely enclosed are given the coordinate:
Offset(inner) = Offset(outer) + (Size(outer) - Size(inner)) / 2

3. Assign "default coordinates" to fragments. This is quite simple. The default coordinate of a fragment is just the barge offset of the clone it is in plus its start position in (database accession) clone coordinates. Default coordinates are then assigned to a raft based on the average of the coordinates of the fragments making up the raft.

  1. Build a "raft ordering" graph. This is a directed graph with two types of nodes: rafts, and clone end points. An edge from one node to another implies the first node comes before the second. Associated with each edge is also a range of distances allowed between the centers of the objects represented by the two nodes.
  2. As an example consider the clones A, B, C containing the fragments a1, a2, b1,b2, and c1,c2 laid out as so:
           AAAAAAAAAAAAAAAAAAAA
           a1a1a1a1  a2a2a2a2a2
                 BBBBBBBBBBBBBBBBBBBBB
                 b1b1b1b1b1b1   b2b2b2
                                   CCCCCCCCCCCCCCCCCCCC
                                   c1c1c1      c2c2c2c2
     
    The rafts are thus a1b1a2, b2c1 and c2. The initial graph would just contain the ends of the clones. Representing the start and end of clone A as As and Ae this looks like:
           As-->--Bs-->--Ae-->--Cs-->--Be-->--Ce
    Adding the rafts gives you:
                      /-->---b2c1----->------|
                     /                       |
                    /             /->-c2->-\ |
                   /             /          \|
          As-->--Bs-->--Ae-->--Cs-->--Be-->--Ce
           \                          /
            \                        /
             \--->---a1b1a2---->----/
  3. Add information from mRNAs, ESTs, paired plasmid reads, BAC end pairs, and ordering information on the fragments that was submitted by the sequencing center. This information is used to connect rafts in the ordering graph in a three step process - building a "bridge" out of alignments of other data with genomic fragments, scoring the bridge, and then adding bridges one at a time, best scoring first to the ordering graph. A bridge defines order and orientation of the fragments, as well as an allowable range of distances between them. The score function for bridges is the sum of two factors. The first factor is based on the type of the information. mRNA information is given the highest weight, then paired plasmid reads, information provided by the sequencing centers, ESTs, and BAC end matches in that order. The second factor is based on the strength of the underlying alignment, and is very similar to the score used for building rafts. Bridges that would conflict with the graph as constructed so far are rejected. Conflicts are detected using Bellman-Ford algorithm as described in the note below.
  4. A similar bridge graph is built between barges, but it is a simple rather than a directed graph.
  5. Walk the bridge graph to get an ordering of rafts. Each bridge is walked in the order of the default coordinates assigned in step 3 subject to the constraint that if a raft has predecessors, all the predecessors must be walked before the raft is walked.
  6. A sequence path through each raft is built as follows:
  7. Find the longest, most finished fragment that passes though each section of the raft. Put the best fragment for the first part of the raft into the sequence path. Find an alignment between the best fragment for the first part of the raft and the best fragment for the second part of the raft. Search for a "crossover point" in the alignment where it would be reasonable to switch the sequence path to the next fragment. This crossover point is ideally 200 bases from the end of the larger more finished fragment, but may be adjusted depending on the exact alignment. Repeat steps c and d to extend the sequence path until the end of the raft is reached.

  8. Build the final sequence for the fingerprint clone contig by inserting the appropriate number of N's between raft sequence paths. Currently 100 N's are inserted between rafts that are part of the same barge, 50,000 N's between barges that are bridged, and 100,000 N's between unbridged barges.

Notes:

The raft ordering graph built in Steps 4 and 5 specifies a partial order on the midpoints of the rafts from a fingerprint clone contig. Additional constraints on this ordering are given by the distance ranges allowed for each bridge between rafts. The entire collection of partial order and distance constraints is represented by a conjunction of difference constraints, as defined, e.g., in [Cormen 1990], pages 540-543. Each distance constraint has the form x - y <= B, where x and y are variables representing the midpoints of two rafts, and B is a constant, representing a bound on the number of base pairs that separate x and y. If it can be inferred from the partial order that x comes before (5' of) y, then we can specify difference constraints that represent both an upper bound B and a lower bound b on the distance between x and y. The upper bound is expressed as y - x <= B. The lower bound is expressed with the distance constraint x - y <= -b. However, if the order of x and y are unknown, then only an upper bound B on the distance that separates them can be specified as part of a conjunction of distance constraints. This is specified by the two constraints x - y <= B and y - x <= B. A lower bound would require a disjunction of two distance constraints. Finally, if is known that x comes before y but no distance bounds are known, then this fact can be represented by the distance constraint x - y <= 0.

The conjunction of distance constraints associated with a raft ordering graph has a feasible solution if and only if there is a placement of the rafts such that their midpoints are linearly ordered in a manner consistent with the partial order in the raft ordering graph, and the distances between the midpoints of bridged pairs of rafts are in the allowed distance ranges. The Bellman-Ford algorithm for single source shortest paths can be used to determine if a conjunction of distance constraints has a feasible solution in time proportional to the number of constraints (bridges) times the number of variables (rafts), as shown in [Cormen 1990]. A similar approach was used for a physical mapping problem by Thayer et al. [Thayer 1999]. The feasibility problem for conjunctions of difference constraints is a special case of the linear programming problem, and can also be solved by linear programming algorithms, but these can be slower.

In the GigAssembler algorithm, bridges between rafts are added incrementally in a greedy fashion, and the Bellman-Ford procedure is only used to test feasibility of a new bridge. Because a lower bound cannot be enforced on the distance between the midpoints of two unordered rafts in a conjunction of difference constraints, the solution of the conjunction of difference constraints may give a positioning of the midpoints of the rafts that puts some pairs of rafts too close, possibly overlapping them in a way that contradicts the sequence data. That is why Step 6 is necessary. It would be more elegant to combine disjunctions and conjunctions of distance constraints in specifying the raft layout problem, but unfortunately such a combination leads to an NP-hard feasibility problem [Papadimitriou 1982]. Thus our final layout step employs a simple heuristic approach, rather than an integrated optimization of all constraints.

Assessment of the Accuracy of the Orientation and Order

We created two test sets to assess the accuracy of the Oct. 5 freeze working draft as assembled by GigAssembler. The first, called FinishedContigs, is a collection of 24 clone contigs with a total of 145 clones taken from chromosomes 7, 12, 14, 17, 20, 21, and 22 for which we have finished sequence spanning the entire clone contig. The number of clones per clone contig varies from 4 to 13. We obtained a draft version for 88 of these clones by looking for a previous version of the finished clone in GenBank. The second test set, called Scrambled22, was generated by Ray Wheeler at Neomorphic, Inc. by taking the twelve finished sequence contigs from chromosome 22 and randomly choosing a tiling path of 233 "synthetic" BACs covering them. The sequence of each synthetic BAC was then "draftified" by the introduction of gaps, indels and substitutions in a way that the statistics on the resulting fragments reasonably matched the statistics from fragments of real draft BACs. Finally, the fragments were given random order and orientation.

We ran the GigAssembler algorithm on all of the clone contigs from both of these test data sets and compared its predicted order and orientation for the fragments to the true order and orientation of the fragments, as can be derived from the finished sequence. We measured the orientation agreement as the fraction of fragments that were oriented correctly. The average orientation agreement for the FinishedContigs test set was 0.90, and varied from 0.50 (near random guessing) to 1.0 (perfect) among the 23 clone contigs. Performance degraded as the number of fragments per draft clone increased and the size of the fragments decreased. On the 12 contigs of the Scrambled22 test set, the average orientation agreement was about 0.87 and varied from 0.71 to 1.0.

To measure the accuracy of the predicted order of the fragments, we counted the number of inversions in the order of the starting positions of the fragments. An inversion occurs when the fragment following fragment A in the predicted order in fact should come before fragment A. For example, if the correct order of the fragments is 1,2,3,4,5,6,7 and the predicted order is 1,5,2,4,7,4,6 then there are 2 inversions, at fragments A=5 and A=7. We measure order agreement as the fraction of fragments in the predicted order where inversions do not occur (excluding the last fragment in the predicted order, which cannot have an inversion). In the above example, the order agreement is 4/6. The average order agreement for the Finished Contigs test set was 0.85, and varied from 0.50 to 1.0 among the 23 clone contigs. On the 12 contigs of the Scrambled22 test set, the average order agreement was 0.83 and varied from 0.74 to 0.93.

Conclusions

GigAssembler could be improved in several ways. A mechanism to detect misassemblies or chimerism in the initial sequence contigs (fragments) that serve as its inputs could be quite helpful, perhaps along the lines of the mechanism developed for CAP2 [Huang 1996]. Improved use of clone end information might lead to better barge construction in Step 2 of the algorithm as well. Both of these improvements would reduce the rate of artifactual duplication in the draft assembly, which occurs when valid overlaps are not detected and thus the same region of the genome is assembled in two different parts of the assembly. It was estimated that 3% of the Oct 7 working draft genome sequence represented artifactual duplication, some due to lack of complete assembly by GigAssembler, and some due to missed overlaps between fingerprint clone contigs [The International Human Genome Sequencing Consortium 2001]. More use could be made of PHRAP quality scores for individual bases during the assembly as well. Finally, it would be helpful to have some methods to impose an upper bound on the total assembled size of a clone, and to eliminate fragments from the assembly entirely when it appears that they don't belong with the clone, or have other serious problems.

The assembly of the working draft of the human genome, while still imperfect, has permitted significant research to go forward, rather than wait years for the finishing of the sequence. In particular, having an assembly has allowed the construction of genome wide gene prediction sets, and the side-by-side comparison of different kinds of genome annotation, including chromosomal band locations, STS positions for genetic, radiation hybrid, YAC-STS and cytogenetic maps, GC content, density of various repeat families, CpG islands, ESTs, mRNAs, SNPs, and both known and predicted genes [The International Human Genome Sequencing Consortium 2001]. These comparisons are possible through the on-line genome browsers at UCSC (http://genome.ucsc.edu/goldenPath/septTracks.html) and ENSEMBL (http://www.ensembl.org), both of which use the GigAssembler assembly. Some of the discoveries that have been made using these and other tools are described in [The International Human Genome Sequencing Consortium 2001, Cheung 2001, Tilford 2001, Bentley 2001, Montgomery 2001, Bruls 2001, Riethman 2001, Wolfsberg 2001, Yu 2001, The International SNP Map Working Group 2001, Futreal 2001, Nestler 2001, Tupler 2001, Fahrer 2001, Li 2001, Bock 2001, Pollard 2001, Murray 2001, Clayton 2001]. However, the web pages at http://genome.ucsc.edu are currently averaging more than 20,000 hits per day, hence we suspect that the bulk of the new discoveries that this work has enabled have yet to be reported.

Acknowledgements

D.H. acknowledges support from DOE Grant F603-99ERG2849, Dean Patrick Mantey, and Chancellor M.R. C. Greenwood for the 100 node computational cluster. Thanks to Alan Zahler for his advice and encouragement. We thank Scot Kennedy, Aaron Tomb, Patrick Gavin, Nick Littlestone and Paul Tatarsky for help testing the sequence and for technical support, and Ann Pace and Maria Guarino for administrative support. We thank Bob Waterston, Eric Lander, Francis Collins, Ewan Birney, Greg Schuler, John Sulston, and the rest of the genome analysis group for their advice and additional support.

References

Altschul, S.F. et al. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17): 3389-402.

Anson, E., and Gene Myers. 1999. Algorithms for Whole Genome Shotgun Sequencing. Proc. RECOMB '99, Lyon, France.

Bentley, D.R. et al. 2001. The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20 and X. Nature, to appear.

Bock, J.B., Matern, H.T., Peden, A.A. & Scheller, R.H. 2001. A genomic perspective on membrane compartment organization. Nature, to appear.

Bonfield, J.K., Smith, K.F., and R. Staden. 1995. A new DNA sequence assembly program. Nucleic Acids Research 23(24): 4992-4999.

Bruls, T. et al. 2001. A physical map of human chromosome 14. Nature, to appear.

Cheung, V.G. et al. 2001. From Chromosomal Aberrations to Genes: Linking the Cytogenetic and Sequence Maps of the Human Genome, Nature, to appear.

Clayton, J.D., Kyriacou, C.P. & Reppert, S,M. 2001. Keeping time with the human genome. Nature, to appear.

Cormen, T.H., Leiserson, C.E., and R.L. Rivest. 1990. Introduction to Algorithms. MIT Press, Cambridge, Mass.

Dunham, I. et al. 1999. The DNA sequence of human chromosome 22. Nature 402: 489-495.

Fahrer, A.M., Bazan, J.F., Papathanasiou, P. & Goodnow, C.C. 2001. A genomic view of immunology. Nature, to appear.

Futreal, P.A. et al. 2001. Cancer and genomics. Nature, to appear.

Green, P. Unpublished. PHRAP-sequence-assembly program. (http://www.genome.washington.edu/UWGC/analysistools/phrap.htm)

Hattori, M. et al. 2000. The DNA sequence of human chromosome 21. Nature 405: 311-319.

Huang, W., and A. Madan. 1999. CAP3: A DNA Sequence Assembly Program," Genome Research 9(9): 868-877.

Huang, X. 1996. An Improved Sequence Assembly Program. Genomics 33: 21-31.

The International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature, to appear.

The International Human Genome Mapping Consortium. 2001. A physical map of the human genome. Nature, to appear.

The International SNP Map Working Group. 2001. A map of human genome sequence variation containing 1.4 million single nucleotide polymorphisms. Nature, to appear.

Ji, Y., Eichler, E.E., Schwartz, S, and R.D. Nicholls. 2000. Structure of chromosomal duplicons and their role in mediating human genomic disorders. Genome Research 10: 597-610.

Li, W. H., Gu, Z., Wang, H. & Nekrutnko, A. 2001. Evolutionary analyses of the human genome. Nature, to appear.

Montgomery, K.T. et al. 2001. A high-resolution map of human chromosome 12. Nature, to appear.

Mullikin, J. Unpublished. SASHA. Sanger Center.

Murray, A.W. & Marks, D. 2001. Can sequencing shed light on cycling? Nature, to appear.

Myers, E.W. et al. 2000. A Whole-Genome Assembly of Drosophila. Science 287(9): 868-877.

Nestler, E.J. & Landsman, E. 2001. Learning about addiction from the human draft genome. Nature, to appear.

Papadimitriou, C.H, and K. Steiglitz. 1982. Combinatorial Optimization. Prentice-Hall, Inc., Englewood Cliffs, New Jersy.

Pollard, T.D. 2001. The human genome, the cytoskeleton and motility. Nature, to appear.

Riethman, H.C. et al. 2001. Integration of telomeric sequences with the draft human genome sequence. Nature, to appear.

Sutton, G.G., White, O., Adams, M.D., and A.R. Kerlavage. 1995. TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects. Genome Science & Technology 1(1): 9-19.

Thayer, E.C., Olson, M.V., and R.M. Karp. 1999. Error Checking and Graphical Representation of Multiple-Complete-Digest (MCD) Restriction-Fragment Maps. Genome Research 9(1): 79-90.

Tilford, C.A. et al. 2001. A physical map of the human Y chromosome. Nature, to appear.

Tupler, R., Perini, G. & Green, M.R. 2001. Gene expression and the human genome. Nature, to appear.

Wolfsberg, T.G., Schuler, G.D. & McEntyre, J. 2001. Guide to the draft genome. Nature, to appear.

Yu, A. et al. 2001. Comparison of human genetic and sequence-based physical maps. Nature, to appear.

Zhao, S. et al. 2000. Human BAC ends quality assessment and sequence analyses. Genomics 63: 321-332.

 

Appendix A - GigAssembler Inputs and Outputs

Inputs

The following input files are expected to be in the contig directory before running GigAssembler.

  1. geno.lst - A list of all of the FASTA format files, one file per line, that contain the unassembled sequence for the contigs. The program expects each clone to be in a separate FASTA file, and the file name to be of the format: directory/accession.fa. Within the file the separate sequence fragments making up a clone need to be separated by FASTA headers of form:
    >accession.fragment
    or
    >accession.version_fragment
    For instance the geno.lst file might contain a line: /usr/local/genomes/human/AC00001.fa.
    The file AC000001.fa should have records such as AC000001.8_1, AC000001.8_2, etc. By convention the version number is the Genbank version number for the sequenced clone, in this case 8.
  2. info - this describes where the clones are positioned within the contig according to the map. The first line contains the name of the contig, and a second word which is either 'random,' 'ordered,' or 'placed.' If the second word is 'placed' then GigAssembler does not try to refine the ordering of clones, but builds barges directly from the information in the rest of the file. (This is currently used for the Y chromosome.) Subsequent lines in the info file contain three space separated fields: clone accession, start position in kilobases, and phase number. The phase numbers are 0-3. 0 means unsequenced, 1 means draft in unordered pieces, 2 means draft in ordered pieces, and 3 means finished.
  3. nt - this file is optional. If it is present it describes the finished clones which should be joined together into larger contigs (aka NT_ contigs using the NCBI terminology). This file is tab-delimited with five fields: NT contig name, clone accession, clone start position in NT contig, clone orientation (always + currently) in NT contig, and clone size. All lines refering to the same NT contig are grouped together with a blank line separating NT contigs.
  4. splitFin - this describes finished sequence files that had to be split to remove contaminants, largely because of bacterial transposons. The file is tab-delimited with four fields: fragment name, clone name, fragment start, fragment end.
  5. fragChains - this describes clone fragments that should be linked together according to information provided by the sequencing centers. The file is tab delimited with six fields: fragment name, fragment orientation (always + currently), name of next fragment, minimum distance between center of this fragment and center of next fragment, maximum distance between centers, chain certainty score (higher is better, currently always set to 50).
  6. self.psl - psLayout alignments of clones in the fingerprint clone contig against each other.
  7. mrna.psl - psLayout near best in genome alignments of full length mRNA vs. clones in the fingerprint clone contig.
  8. bacEnd.psl - psLayout near best in genome alignments of all BAC end reads vs. clones in the fingerprint clone contig.
  9. bacEndPairs - file describing which BAC ends should be paired. The format is tab-separated with three fields: read1, read2, BAC name.
  10. est.psl - psLayout near best in genome alignments of all ESTs vs. clones in the fingerprint clone contig.
  11. estPairs - file describing 3'/5' pairs of ESTs. The format is tab-separated with three fields: 5' read, 3' read, clone name.
  12. pairedReads.psl - psLayout near best in genome alignments of plasmid pair reads vs. clones in the fingerprint clone contig.
  13. readPairs - file describes how paired reads should be put together. The format is tab-separated with five fields: forward read, reverse read, plasmid name, minimum distance between reads, maximum distance between reads.

Outputs

Following are the outputs of GigAssembler. Many of the output file names include a numerical suffix which corresponds to the GigAssembler version number, currently 93, in order to facilitate comparisons between versions. Since the version number changes rapidly it is represented as "NN" in the file names below.

  1. The assembled sequence is stored in a FASTA format file named contig.fa, where 'contig' is the name of the contig taken from the 'info' input file above.
  2. gold.NN - this describes how the sequence fragments are combined to form the assembly. The format is tab delimited. There are two types of records. Sequence records have eight fields: contig name, start position (starting with 1) of fragment in assembled contig, end position (including this base) of fragment in assembled contig, record index, sequence type (always 'O'), fragment name, start position within fragment, end position within fragment, orientation of fragment relative to assembly (+ or -). Gap records also have eight fields, the first four of which are the same as sequence records. The last four fields are: type (always 'N'), length in bases, type (either 'fragment' or 'clone'), and bridging information (currently either 'yes' or 'no').
  3. barge.NN - this describes the order of the clones within the contig after rearrangement of the map based on sequence overlap. The first four lines of this file are a header which describes the version number and the fields. Then there is a line for each clone containing six space delimited fields: start base position, clone name, clone size, overlap with next clone, name of clone which overlaps most with the current clone, the number of bases in the maximal overlap. (When the maximal overlapping clone is not a neighbor it may indicate a problematic assembly.) A gap line is inserted when there is no overlap between adjacent clones. This line has the format "----- type gap -----" where type can be either 'open' or 'bridged'. Bridged gaps are traversed by a BAC end pair or by an mRNA alignment.
  4. ocp.NN - this describes the sequence overlap between clones. It is tab-delimited with the following fields: accession1, accession2, overlap in bases.
  5. gl.NN - this describes the placement of all input sequence fragments (not just the ones actually included in the assembled output). It also describes how the fragments are linked together. It is a tab-separated file with five fields: fragment name, start position in assembly, end position in assembly, orientation, scaffold ID, raft ID. The last two fields may be empty. The raft ID field is of the format R plus a number. All fragments with the same raft ID are in the same raft (that is part of a mutually overlapping set of sequences. The scaffold ID field is G followed by a number. Fragments that are part of rafts that are connected by mRNA or other ordering information share the same scaffold ID.
  6. raft.NN - This describes the rafts, that is the sets of overlapping clone fragments. The first two lines are a header which includes the total number of rafts with more than one fragment, and the number of fragments contained in these rafts. Next is a set of lines for each raft with blank lines separating rafts. An example of the first line of a raft is:
    79 raft 97373 bases 29 frags 8741 default pos
    where '79' is the unique id number of the raft, '97373' is the estimated size of the raft (the actual size may vary slightly), '29' is the number of clone fragments in the raft, and '8741' is the approximate position of the raft in the fingerprint clone contig. Following this is one line for each fragment containing four space separated fields: start position, orientation, fragment name, fragment size.
  7. graph.NN - This describes how rafts are linked together by mRNA and other information. The rafts form nodes in a directed graph which contains minimum and maximum distances allowed on each node. The first two lines are a header describing the file. Then there is a group of lines for each node. The first line for a node contains the ID of the node, which is either the start or end of a clone or a raft ID as defined in the raft.NN file. Subsequent lines for the same node are indented and represent the various edges going out of a node. Here's an example of such a line:
    180 min 188786, max 344209, AQ570819&AQ570641(bacEndPair),M76446(mRNA),
    180 is the ID of the destination node, which is constrained to be between 188786 and 344209 bases (measured from center to center) from the current node. These constraints are based on the BAC end pair with the accessions AQ570817 and AQ570641 and also on the mRNA with accession M76446.
  8. contaminated.NN - clones which the program suspects, due to inconsistencies in the patterns of overlap with other clones, may be due to sequencing projects containing chimeric clones or mixtures of multiple clones are described here. This is a tab delimited file with two fields: accession and type of contamination suspected. This file is only produced when needed.
  9. fragMap.NN - this describes the 'default' location of each sequence fragment in the assembly. It's tab-delimited with the following three fields: fragment name, default position, size. The default locations are taken from the order of clones in the fingerprint clone contig, and the order of fragments within the clones. If the program is not able to find any other linking information to order the fragments, it will revert to this default order.
  10. ooGreedy.log - this contains diagnostic output useful mostly to W.J. Kent. ('ooGreedy' is an older name for the GigAssembler program).
  11. ooGreedy.NN.gl - This file is the same as gl.NN, except it only contains the first three fields.

Appendix B - Alignment Scoring Functions

Scoring for Clone Fragment Overlap in Raft Building:

int fragOverlapScore(struct psl *alignment)
/* Return score from roughly 500 to -500 for fragment/fragment
  overlap based on alignment. */
{
int milliBad;   /* Misalignments in rough parts per thousand. */
int startTail;  /* Size of starting tail. */
int endTail;    /* Size of ending tail. */

/* Calculate basic score. */
milliBad = calcMilliBad(alignment);
findTails(alignment, &startTail, &endTail);
score = -30*milliBad - (startTail+endTail)/2 +
        round(25*log(alignment->match+1) + log(alignment->repMatch+1));

/* Add some penalties for not having many unique matches or being 
  part of a small fragment. */
if (!enclosingPsl(alignment))
    {
    if (alignment->match < 20)
        score -= (20-alignment->match)*25
    if (alignment->qSize < 5000)
        score -=  (5000-alignment->qSize)/40;
    }
return score;
}

mRNA Scoring Function

int scoreMrnaPsl(struct psl *ali, boolean isEst)
/* Return score for one mRNA oriented psl. */
{
int milliBad;
int score;

milliBad = calcMilliBad(ali, TRUE);
score = 25*log(1+ali->match) + log(1+ali->repMatch) - 10*milliBad + 10;
if (ali->match <= 10)
    score -= (10-ali->match)*25;
if (isEst)
    score -= 25;
else
    score += 25;
return score;
}

BAC End Pair/Plasmid Pair Scoring Function

int scorePairedPsl(struct psl *ali, int mapDistance, boolean isBac)
/* Return score for one BAC end oriented psl. */
{
int milliBad;
int score;
int startTail, endTail;

milliBad = calcMilliBad(ali, FALSE);
findTails(ali, &startTail, &endTail);
score = 25*log(1+ali->match) 
    + log(1+ali->repMatch) - 20*milliBad - 4*(startTail+endTail);
if (ali->match <= 20)
    score -= (20-ali->match)*25;
if (mapDistance > 200000)
    score -= (mapDistance-200000)/1500;
if (isBac)
    score -= 50;
return score;
}

 

Functions used by other scoring functions:

int calcMilliBad(struct psl *ali, boolean isMrna)
/* Return misaligning ratio in roughly parts per thousand. */
{
int qAliSize, tAliSize, aliSize;
int milliBad;
int sizeDif;
int insertFactor;

qAliSize = ali->qEnd - ali->qStart;
tAliSize = ali->tEnd - ali->tStart;
aliSize = min(qAliSize, tAliSize);
if (aliSize <= 0)
    return 0;
sizeDif = qAliSize - tAliSize;
if (sizeDif < 0)
    {
    if (isMrna)
        sizeDif = 0;
    else
        sizeDif = -sizeDif;
    }
insertFactor = ali->qNumInsert;
if (!isMrna)
    insertFactor += ali->tNumInsert;
milliBad = 
    (1000 * (ali->misMatch + insertFactor + round(3*log(1+sizeDif)))) / aliSize;
return milliBad;
}

boolean enclosingPsl(struct psl *ali)
/* Return TRUE if one fragment is entirely enclosed by another. */
{
if (ali->tStart <= 300 && ali->tEnd + 300 >= ali->tSize)
    return TRUE;
if (ali->qStart <= 300 && ali->qEnd + 300 >= ali->qSize)
    return TRUE;
return FALSE;
}

void findTails(struct psl *ali, int *retStartTail, int *retEndTail)
/* Find the length of "tails" (rather than extensions) implied by ali. */
{
int orientation = ali->orientation;

int qFloppyStart, qFloppyEnd;
int tFloppyStart, tFloppyEnd;

if (orientation > 0)
    {
    qFloppyStart = ali->qStart;
    qFloppyEnd = ali->qSize - ali->qEnd;
    }
else
    {
    qFloppyStart = ali->qSize - ali->qEnd;
    qFloppyEnd = ali->qStart;
    }
tFloppyStart = ali->tStart;
tFloppyEnd = ali->tSize - ali->tEnd;
*retStartTail = min(qFloppyStart, tFloppyStart);
*retEndTail = min(qFloppyEnd, tFloppyEnd);
}