Genome Browser Database

The Genome Browser Database

Overview and Design Considerations
Depositing Data
Browser Extensible Data (BED) Format
Accessing Data
Table Formats

Overview and Design Considerations

The Genome Browser Database (GBD) stores a variety of information in a SQL database (MySQL) using a primarily relational approach. To speed and simplify the storage and retrieval of arrays and multiple-level data structures, GBD provides a mechanism for arrays and structures to be buried within a blob field. The database is used intensely by the draft human genome browser. It also will become a central depository of annotation data for the mouse analysis group. The database can be queried either directly in SQL, via the Perl DBI interface, or by MySQL's native C interface.

There are two primary classes of objects which are stored in the database - objects which which can be displayed on the tracks display and other objects. A track corresponds logically to a table in the database which includes fields specifying a chromosome and a start and end location within that chromosome. Because index performance depends on the table size, very large tables such as those describing repeats or EST alignments are divided into a separate table for each chromosome. In addition to the chromosome, start, and end fields, track objects are required to have a name field - which is displayed to the left of the object itself when the track is fully unfolded. This name can be also be used to associate the parts of the object stored in the track table with other parts of the same object stored in secondary tables.

Depositing Data

GBD accepts data in four fixed formats and in an extensible format which will be described below. The fixed formats are RepeatMasker .out files, Gene Transfer Format (.gtf) files, Accession Golden Path (.agp) files, and Pattern Space Layout (.psl) files. To submit data to GBD please put the relevant files where they can be reached by FTP or HTTP, or if they can compress to smaller than 100K send them as an email attatchment. Next send email to me, jim_kent@pacbell.net, including the following information:

Location of files
Whether the data should be restricted to the mouse analysis group or if it can be made public. (I very much encourage public data.)
Whether the data is meant to be displayed on the browser.

If the data is destined for the browser please include the following additional information:

A short (no more than 15 character) description of the track.
A longer (up to 60 character) description of the track.
Whether the "score" field should direct the level of gray in the display.
A html format description of the track which should be between a paragraph and a page.

Browser Extensible Data (BED) Format

Annotations that don't fit into the .out, .gtf, .agp, or .psl format should be submitted in a variation of the browser extensible data (BED) format. A BED format submission consists of a main file with fields separated by tabs and records separated by spaces. It is usually very easy to generate such a file from a relational database, or directly from an annotation program. Each line in the file must be of the same format. Lines beginning with the # character are treated as comments. A BED submission also includes a separate file which describes the fields. Optionally a BED submission can include a third file which includes an HTML description of each item. BED submissions which are meant to be displayed in the browser are required to have four initial fields, and are strongly encouraged to use up to nine additional standardized fields when relevant. The fields of BED submissions not displayed in the browser are not restricted.

The file which describes fields is in AutoSql (.as) format. Here is a sample of the format which describes a type which includes all four required and eight standard fields, and defines two custom fields as well:

table primateAlignment
"Describes an alignment between the human genome and a primate single read"
   (
   string chrom;      "Human chromosome or FPC contig"
   uint   chromStart; "Start position in chromosome"
   uint   chromEnd;   "End position in chromosome"
   string name;       "Name of read - up to 15 characters"

   uint   score;      "Score from 0-1000.  1000 is best"
   char[1] strand;    "Value should be + or -"
   uint   reserved1; "Reserved - always 0 currently"
   uint   reserved2;   "Reserved - always 0 currently"
   uint   reserved3;  "Reserved - always 0 currently"
   uint   blockCount; "Number of separate blocks (regions without gaps)"
   uint[blockCount] blockSizes;  "Comma separated list of block sizes"
   uint[blockCount] chromStarts;  "Start position of each block in relative to chromStart"

   string primate;    "Genus species name of primate"
   string center;     "Name of sequencing center that made this read"
   )

In the required fields the 'chrom' must be either "chrN" where N is a particular chromosome, or "ctgN" where N is a particular FPC contig. Files submitted in contig coordinates will be converted to chromosome coordinates as they are loaded into the database. The start and end coordinates define a "half open zero based interval". That is to describe a range covering the first 1000 bases of a sequence, the start would be 0 and the end 1000. (You can convert from this format to the format used by GFF and Ensembl simply by subtracting one from the start coordinate.)

Note that the required and standard fields must be in the above order with the above names. Feel free only to use the first few standard fields. If you use later standardized fields please include earlier standard fields, even if you have to set all of the score fields to 1000 because score is irrelevant to your object.

As noted before the field description file is in AutoSql .as format. AutoSql is a code generator that generates SQL table creation statements, C structures, and C code to translate between database and C formats. You can download AutoSql from my executable directory or build it yourself from my source directory. Documentation is available in the doc subdirectory of the executable directory. You don't need to use AutoSql to make a BED submission, though people working in C in particular may find it useful. Though AutoSql can generate code that embeds composite objects in a single field, I encourage everyone as much as possible to use simple separate fields.

BED submissions for the tracks display may also include an additional file to provide specific information for each item when the user clicks on an item. This file is tab delimited with two fields. The first field is the name of the item, which should match the name field in the main tab file. The second field is HTML format text with the limitation that it can contain no tabs, quotes, or newlines. (Plain text lacking tabs quotes and newlines can also be placed here.)

Accessing Data

GBD is a MySQL database which can be accessed in SQL from the mysql program, in PERL via the DBI interface, or by MySQL's native C interface. You'll need the host name, user name and password as well, which you will recieve by email. There is a database for each 'freeze' which is named hg#. The April 2001 freeze is hg7, the previous freeze hg6, the one before that hg5, going back to hg3 - the first freeze on which a browser was built at UCSC. These databases are read only outside of UCSC.

Table Formats

A description of the format of each table can be found at http://genome.cse.ucsc.edu/goldenPath/gbdDescriptions.html. This document is not always completely up to date. If it's not there please look at the .as files in http://www.cse.ucsc.edu/~kent/src/unzipped/hg/lib/, which are the AutoSql sources used by the browser.

for additional information please contact jim_kent@pacbell.net.