[Genome] questions for knowngene to refseq
Fan Hsu
fanhsu at soe.ucsc.edu
Wed Nov 1 08:57:37 PST 2006
Hi MengRu,
Although the majority of UCSC Known Genes (KG) are identical to RefSeq
genes,
there are some significant differences:
1. Not every RefSeq is a KG. Some RefSeqs were filtered out because
they did not pass our gene-check processing step (e.g. RefSeqs with
no start or stop codons, or bad reading frames are filtered out).
2. If there is a UniProt protein which maps well to a GenBank mRNA,
and it passes the gene-check filter and there is no equal or better
corresponding RefSeq, the mRNA/UniProt pair will be added to the
KG data set.
3. UCSC KG is updated once in a few months. Our RefSeq track
is updated nightly. So the refGene table may contains some
latest RefSeq updates that came after the last KG build.
The most accurate cross-reference between KG and RefSeq could be found in
the kgXref table.
The knownToRefSeq table is constructed to support UCSC Gene Sorter,
which is based on a canonical set of gene clusters. Each
gene cluster may consist of several overlapping KGs, with a single
representative KG (could be a RefSeq or an mRNA). And if a RefSeq
overlaps with a gene cluster region, it will be added into the
kwownToRefSeq table. So there will be situations that a RefSeq
did not make it to UCSC KG and yet it could show up in the
knownToRefSeq table.
The paper, The UCSC Known Genes. Bioinformatics 22(9), 1036-46 (2006),
Hsu, F., Kent, W.J., Clawson, H., Kuhn, R.M., Diekhans, M., and Haussler, D.
describe our KG I process. We have substantially revised and improved
our KG build process, and the new process is called KG II. A description
of the KG II process (attached below for your convenience) could be found
at:
http://genome.ucsc.edu/cgi-bin/hgGene?hgg_do_kgMethod=1&hgg_type=knownGene
For KG II, there is no longer a strict one-to-one relationship between the
representative
mRNA and protein. We noticed that this has created some problem in certain
situations and we are considering to go back to this one-to-one relationship
for our future KG III process.
If you have any further questions about KG, feel free to ask.
Fan.
BTW, I was born and grew up in Taiwan. It is nice to hear someone
from my homeland. :-)
Methods
This release of UCSC Known Genes was built by a new process, KG II, as
described below.
UniProt protein sequences (including alternative splicing isoforms) and mRNA
sequences from RefSeq and GenBank were aligned against the base genome using
BLAT. RefSeq alignments having a base identity level within 0.1% of the best
and at least 96% base identity with the genomic sequence were kept. GenBank
mRNA alignments having a base identity level within 0.2% of the best and at
least 97% base identity with the genomic sequence were kept. Protein
alignments having a base identity level within 0.2% of the best and at least
80% base identity with the genomic sequence were kept.
Then the genomic mRNA and protein alignments were compared, and protein-mRNA
pairings were determined from their overlaps. mRNA CDS data were obtained
from RefSeq and GenBank data and supplemented by CDS structures derived from
UCSC protein-mRNA BLAT alignments. The initial set of UCSC Known Genes
candidates consists of all protein-mRNA pairs with valid mRNA CDS
structures. A gene-check program (similar to the one used for the Consensus
CDS (CCDS) project) is used to remove questionable candidates, such as those
with in-frame stop codons, missing start or stop codons, etc.
>From each group of gene candidates that share the same CDS structure, the
protein-mRNA pair having the best ranking and protein-mRNA alignment score
is selected as a UCSC Known Gene. The ranking of a gene candidate depends on
its gene-check quality measures. When all else is equal, a preference is
given to RefSeq mRNAs and next to MGC mRNAs. Similarly a preference is given
to gene candidates represented by Swiss-Prot proteins. The protein-mRNA
alignment score is calculated based on protein to mRNA alignment using
TBLASTN, plus weighted sub-scores according to the date and length of the
mRNA.
-----Original Message-----
From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On
Behalf Of mirrian at iis.sinica.edu.tw
Sent: Wednesday, November 01, 2006 2:40 AM
To: genome at soe.ucsc.edu; genome-mirror at soe.ucsc.edu
Subject: [Genome] questions for knowngene to refseq
Dear Sir,
I am trying to link these two kinds of data, refseq download from
NCBI and known gene from UCSC. I have downloaded these two tables,
kgXref and knownToRefSeq. I found that these two tables are
different, but both contain the knowngene info and refseq info. For
example, kgXref contains 32750 records that related to refseq from
NCBI homo build 36.1, whie knownToRefSeq contains 33961 records that
related to refseq from NCBI homo build 36.1. I'm wondering which one
is more accurate than the other and what causes this difference.
Furthermore, from the info in "The UCSC Known Genes" published in
Feb.24, 2006, Genome analysis, if an mRNA has multiple proteins,
choose the best from the order, PDB,Swiss-Port, TrEMBL. And for the
other hand, if one protein has multiple mRNA, choose the best in favor
of longer and newer one with less mismatches. Does that means Known
Genes DB only contained the one to one relationship between protein
and mRNA? However, looking at the knownToRefSeq table, the
relationship between known gene and refseq is not one to one. About
22747 records shows that one refseq has multiple known genes. Would
you mind to tell me what causes this?
Thanks for your help, and look forward to your response.
Best Regards,
MengRu
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.21/509 - Release Date: 10/31/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.22/512 - Release Date: 11/1/2006
More information about the Genome
mailing list