[Genome] New compression Routine fits Human Genome on 1 CD
John Sullivan
sullivan at ireland.com
Fri Jul 14 07:13:26 PDT 2000
Hello all,
We have written a compression routine that can compression
routine that can compress the large .fa file by about 75%. Using this
routine, we have been able to fit the human genome on one cd.
It works by taking four bytes of DNA and encodes them as one
byte. So A->00, C->01 T->10 G->11. It also includes handling for
arbitrary sequences of N's that can occur with the .fa files.
We analysed different methods of trying to compress a .fa file.
The file that we tested was ctg17711.fa. DnaC is the compression
routine we wrote.
Original Size 29,433,393
Winzip 8,163,238
GZip 7,851,192
Bzip 7,130,597
DnaC 6,428,073
DnaC(Zipped) 6,050,537 File was compressed with Dnac,then with winzip
DnaC(GZipped) 6,049,666 File was compressed with Dnac,then with gzip
DnaC(Bzipped) 5,969,287 File was compressed with Dnac,then with bzip
My conclusions are as follows.
1. Using bzip alone reduces the size of the download for the human
genome by 12% compared to the zipped version.
2. Using dnac reduces the size of the download for the human genome
by 20% compared with the zipped version.
3. Compressing the results of dnac, achieves another 8-10% in
compression compared to using winzip.
4. We achieved a 200Mb reduction in the size of contigFa.zip. This
makes it fit on one CD, with over 50Mb to spare on the CD.
(We used "DNAC" on every file, and then winzipped the result)
Our own background is as follows.
My name is John Sullivan. In 1994, I won the Irish informatics
olympiad, and in 1997, I was 16th in the ACM programming contest. I
have also won the Irish maths intervarsities, and the Irish maths
Olympiad. I have a BSc Honors in Maths and Physics from UCC.
Tim Chang, has a BS in Biology with concentration on Genetics from
the University of Massachusetts at Amherst.
If anyone is interested in a copy of the CD, We will fedex it out for
free. We are interested in the distribution of such a CD.
Yours Sincerely
John Sullivan and Tim Chang
_____________________________________
Get your free E-mail at http://www.ireland.com
More information about the Genome
mailing list