[Genome] New compression Routine fits Human Genome on 1 CD

John Sullivan sullivan at ireland.com
Fri Jul 14 07:13:26 PDT 2000


Hello all,

     We have written a compression routine that can compression 
routine that can compress the large .fa file by about 75%. Using this 
routine, we have been able to fit the human genome on one cd.

     It works by taking four bytes of DNA and encodes them as one 
byte. So A->00, C->01 T->10 G->11. It also includes handling for 
arbitrary sequences of N's that can occur with the .fa files. 

     We analysed different methods of trying to compress a .fa file.
The file that we tested was ctg17711.fa. DnaC is the compression 
routine we wrote.
Original Size 29,433,393
Winzip        8,163,238
GZip          7,851,192
Bzip          7,130,597
DnaC          6,428,073
DnaC(Zipped)  6,050,537 File was compressed with Dnac,then with winzip
DnaC(GZipped) 6,049,666 File was compressed with Dnac,then with gzip
DnaC(Bzipped) 5,969,287 File was compressed with Dnac,then with bzip

My conclusions are as follows.
1. Using bzip alone reduces the size of the download for the human 
genome by 12% compared to the zipped version.
2. Using dnac reduces the size of the download for the human genome 
by 20% compared with the zipped version.
3. Compressing the results of dnac, achieves another 8-10% in 
compression compared to using winzip.
4. We achieved a 200Mb reduction in the size of contigFa.zip. This 
makes it fit on one CD, with over 50Mb to spare on the CD.
(We used "DNAC" on every file, and then winzipped the result)

Our own background is as follows. 
My name is John Sullivan. In 1994, I won the Irish informatics 
olympiad, and in 1997, I was 16th in the ACM programming contest. I 
have also won the Irish maths intervarsities, and the Irish maths 
Olympiad. I have a BSc Honors in Maths and Physics from UCC.
Tim Chang, has a BS in Biology with concentration on Genetics from 
the University of Massachusetts at Amherst.

If anyone is interested in a copy of the CD, We will fedex it out for 
free. We are interested in the distribution of such a CD.

    Yours Sincerely
        John Sullivan and Tim Chang

_____________________________________

Get your free E-mail at http://www.ireland.com




More information about the Genome mailing list