Alastair Fyfe
CMPS244 Project – Spring ‘01
Web Copy : http://www.cse.ucsc.edu/~afyfe/244p2.htm
Track on the Class Mirror Browser : c2h2zf
Introduction.
Protein binding to nucleic acids is an essential step in many biological processes including gene expression, DNA duplication and DNA packing. Proteins have evolved a wide array of mechanisms for binding to DNA, each fit for a specific task. Mechanisms can be analyzed and classified by various criteria of which two of the most important are affinity and specificity. For example, the processive DNA polymerases responsible for the bulk of DNA duplication bind single-stranded DNA with great affinity but little specificity. Other proteins, such as the large T antigen protein responsible for recognizing the origin of replication in duplication of SV40 DNA,bind relatively weakly but with great specificity. A recent survey of DNA binding proteins [2] organized the 240 DNA binding proteins whose solved structures are available in the Protein Data Bank into 8 overall classes which are in turn divided into 54 structural families.
In this report I will concentrate on the cysteine-2 histidine-2 zinc finger (C2H2ZF) DNA-binding motif. This motif occurs very commonly in eukaryotic transcription factors and is among the best characterized of the DNA binding motifs. Its modular structure and relatively simple mode of binding have motivated the search for a "recognition code" whose discovery would enable the engineering of proteins capable of binding to any desired DNA sequence. Development of this capability is an essential prerequisite for therapeutic technologies, such as gene therapy, that depend on being able to target specific elements within the human genome.
The three main goals of the project were to:
The next two sections review the C2H2ZF motif with the first focusing on its structure and properties and the second on a review of efforts to uncover a DNA recognition code. The fourth section summarizes some properties of the C2H2ZF motif occurrences found in the human genome and explains how those motifs were identified. The next two sections focus on the prediction of binding sites. The first of the two describes the methods used to adapt existing studies into a tool for predicting C2H2ZF binding sites and the second describes the predictions obtained for a particular protein, ZNF151, The final two sections discuss some of the strengths and limitations of the approach used and suggest some interesting areas for additional work
Cysteine-2, Histidine–2 Zinc Fingers
The first characterization of the C2H2ZF motif was done by Aaron Klug at MRC, Cambridge University, in 1985 [1] The motif was found in the TFIIIA transcription factor for the 5S ribosomal RNA gene from Xenopus laevis, a protein that has the distinction of being the first known eukaryotic transcription factor. Inspection of the protein revealed the presence of 9 repeats of a 30 residue pattern and an abundance of zinc. Klug correctly predicted a structure in which the two Cys and His residues conserved in each repeat tetrahedrally coordinate a zinc ion.
This was the first member of what has become a class of zinc-coordinated binding proteins that includes the hormone-nuclear receptor and GAL4-type families[2]. Hence the C2H2ZF motif is often referred to as the "classic" zinc finger.
The motif consists of about 30 amino acids. Its secondary structure elements are an a-helix and two anti-parallel b sheets. The two sheets and the helix are held in a fixed conformation by the coordinated zinc ion and by 3 conserved hydrophobic residues [3].
The two model systems that have been studied most extensively are zif268, a three-finger transcription factor from mice involved in early stages of development, and TFIIIA a nine-finger transcription factor essential for transcription of 5S ribosomal RNA in Xenopus. The first crystal structure of a C2H2 zinc finger was published by Pavletich and Pabo in ’91 [6] and a refined structure at 1.6A was published in ’96 [10] with PDB entry 1AAY. An image drawn from the 1AAY coordinates and available in the Protein-Nucleic Acid Complex Database, http://www.rtc.riken.go.jp/jouhou/3dinsight/complexdb.html, is shown below.

From the image it is apparent that residues in the alpha helix are in close contact with Watson-Crick base pairs in the DNA major groove. The conventional numbering for the a-helix residues references the first residue immediately before the helix as –1.
C2H2ZF motifs possess a number of interesting characteristics:
The Search
for a Recognition Code
The study of how proteins recognize specific DNA sequences has long fascinated molecular biologists. This section reviews progress in answering this question in the context of the C2H2ZF motif.
A 1976 paper by Seeman, Rosenberg and Rich[11] set out some important principles. The authors asked what pattern of hydrogen bonding would allow an amino acid inserted in the major groove of B-DNA to distinguish among the 4 possible Watson-Crick pairs. From inspection of the stereochemistry they concluded that, on the basis of hydrogen-bond based discrimination alone, (a) a single hydrogen bond was inadequate for discrimination and that (b) a few two-hydrogen bond interactions were likely. These include a two-bond system between either asparagine or glutamine and the adenine side of a U(T)-A pair and another two-bond system between arginine and the guanine in a GC pair. Subsequent analysis has borne out these predictions. This paper led to an early appreciation of the fact that only a few of the interactions necessary for protein- DNA recognition are readily identifiable.
In 1991, Pavletich and Pabo [6] solved the structure of the three finger zif 268 protein complexed with DNA. Their crystallographic analysis provided experimental support for the predictions made by Seeman et al. The three triples in the 5’ GCG TGG GCG 3’ sequence used in the formation of the protein-DNA complex was rich in guanines and five of the six observed protein-DNA hydrogen bonds occurred between arginine and guanine.
In a study published in 1992 Jacobs [8] analyzed 1340 C2H2ZF motifs obtained from 221 proteins, primarily transcription factors obtained by scanning protein databases and the literature for the appropriate pattern. Jacobs analyzed the pattern of variation and reasoned that sites that tended to vary significantly within a multi-finger protein but be relatively conserved across similar proteins were probably involved in DNA binding, i.e. in sequence recognition. He concluded that positions –1, 3 and 6 were most likely to be involved in DNA recognition. These results agreed with analysis of the crystal structure and suggested that a protein<>DNA recognition code could be found by focusing on three specific positions within the C2H2ZF motif.
A number of different studies in the early ‘90s used combinatorial chemistry techniques to try to elucidate this code. The most comprehensive was reported in two papers by Yen Choo and Aaron Klug [4][5]. These studies built variants of zif268 by randomly altering the amino acids at helix positions –1 to 8 (excluding the conserved Leu and His ) of the middle finger. The ability of variants to bind to all possible 64 DNA triples was then tested by means of phage-display based selection. Their results supported the identification of residues –1, 3 and 6 as the key determinants of binding specificity, though position 2 was also found to play an important and less direct role. The data published in this study forms the basis of the binding-site predictions computed in this project as explained in the section “Quantifying Binding Preferences”.
More recent studies have made apparent that no simple C2H2ZF recognition code is likely to be found. The solution of the structure of the GLI protein and detailed comparison of the geometry of different C2H2ZF docking arrangements in known structures[12] indicates that a number of distinct arrangements of the motif exist, each with a distinct recognition code. Even for the canonical zif268 docking arrangement, factors such as linker length, and both inter-finger and intra-finger correlation have been found to play a significant role. Nevertheless, while the overall recognition code, even for zif268, may not be simple, there is good reason to expect that accumulation of sufficient experimental data will yield a database from which high-quality predictions can be computed.
This section describes how occurrences of C2H2ZF motifs in the human genome were identified and summarizes some properties of those occurrences. This data is already available elsewhere: several of the existing annotations of human genome data include information about the protein family membership of gene products. For example, querying the UCSC genome browser for "zinc finger" returns information about 460 mRNA associated gene locations. However, for this project I was interested in investigating the mechanics of locating a particular protein motif and thus did not rely on the existing classifications.
Overall, the search strategy consisted of three steps. First, the protein sequences that correspond to the translation product of known genes were extracted. This collection of protein sequences was then searched with the HMMER[14] search tools using the PFAM[13] hmm for the C2H2ZF motif. This search yielded a set of protein subsequences that were likely instances of the motif. Finally, the highest scoring entries in this set were post-processed to aggregate adjoining fingers into polydactyl motifs.
HMMER is a collection of hidden Markov model (HMM) tools developed and distributed by Sean Eddy at the University of Washington. The suite includes tools for building profile HMMs from a multiple alignment, estimating the distribution function of scores calculated by an HMM and searching a database of sequences using an existing HMM. I used version 2.1.1 of the package
The PFAM database[13] maintains a collection of hand-curated multiple alignments associated with known protein folds. For each such alignment, the database also provides a profile HMM whose states and emission probabilities are derived from the alignment. The HMM used in this project is named "zf-C2H2" with accession number PF00096. The associated alignment includes 10312 proteins, a number that attests to the widespread distribution and well-characterized nature of this domain.
The protein sequences used for the project were obtained for the October 7, 2000 version of genome database using annotation "genieKnownPep" These are 8243 protein sequences of known genes. Application of the zf-C2H2 HMM yielded 1825 possible motif instances in 404 genes. If the score value for a motif was low enough that the expected number of occurrences with that score value was 1 or greater, the score was not used. This cutoff reduced the number of motifs to 1524.
The next step was to assemble adjacent fingers into separate polydactyl motifs. The distribution of linkers is shown below where linkers longer than 21 have been aggregated.
|
Linker length |
Occurrences |
|
0 |
209 |
|
1 |
1 |
|
2 |
1 |
|
3 |
9 |
|
4 |
7 |
|
5 |
23 |
|
6 |
1042 |
|
7 |
21 |
|
8 |
17 |
|
9 |
16 |
|
10 |
6 |
|
11 |
4 |
|
12 |
2 |
|
13 |
1 |
|
16 |
3 |
|
17 |
2 |
|
18 |
4 |
|
19 |
3 |
|
20 |
1 |
|
21 and higher |
150 |
The common occurrence of a linker of size six is consistent with the canonical five-residue linker in zif268, though it is not yet clear whether the single residue discrepancy is real or simply due to the way adjacent alignments were calculated for this project. A sequence logo, drawn by the WebLogo server (http://www.bio.cam.ac.uk/seqlogo/) , for the 1042 linkers of size 6 is shown below. Again, the composition of the linker is in very good agreement with the canonical PKEGT linker sequence that occurs in more than half[3] of 5 residue linkers in the Transcription Factors Database.

For purposes of prediction, one of the most interesting characteristic of C2H2ZF motifs is the distribution of residues at the key –1, 3 and 6 positions of the a-helix.
|
Helix Position: |
-1 |
3 |
6 |
|
Ala |
18 |
102 |
89 |
|
Cys |
41 |
17 |
6 |
|
Asp |
81 |
84 |
28 |
|
Glu |
44 |
84 |
95 |
|
Phe |
15 |
25 |
6 |
|
Gly |
24 |
55 |
21 |
|
His |
108 |
228 |
21 |
|
Ile |
6 |
17 |
94 |
|
Lys |
69 |
53 |
171 |
|
Leu |
41 |
38 |
72 |
|
Met |
18 |
17 |
26 |
|
Asn |
68 |
224 |
73 |
|
Gln |
324 |
95 |
211 |
|
Arg |
294 |
39 |
290 |
|
Ser |
120 |
245 |
83 |
|
Thr |
105 |
90 |
89 |
|
Val |
31 |
42 |
119 |
|
Trp |
52 |
1 |
1 |
|
X(err) |
3 |
1 |
0 |
|
Tyr |
62 |
60 |
28 |
|
Pro |
0 |
7 |
1 |
The large number of Asn, Glu, Gln, Arg and His residues is consistent with the expected occurrence of residues capable of forming two hydrogen bonds with Watson-Crick pairs in the major groove. The common occurrence of Ser is more difficult to account for.
Quantifying
Binding Preferences
The "binding sites signatures" reported in Choo and Klug in [5] provide the most comprehensive published data to date on the specificity of DNA triplet binding obtained for a zif268-like C2H2ZF motif from a background of randomly selected amino acids. This section briefly reviews the methods they used and then describes how their published data was adapted for use in this project.
The phage display protocol used by Choo and Klug [4] entailed three steps. In the first step, a library of about 2.6E6 variants of wild-type zif268 were cloned and ligated into the gene encoding the protein coat of fd phage. These combinatorial variations of zif268 were designed by randomly varying the nucleotides responsible for encoding positions –1 to 8 of the middle finger of zif268. Following transcription and translation, the resulting fd phages displayed one of the synthetic C2H2ZF fingers on their protein coat. The second step of the protocol involved affinity purification. A library of oligodeoxynucleotides was prepared which included the appropriate sequences for binding by fingers 1 and 3 of the modified zif268 and all 64 possible combinations for the middle, finger two, binding site. The oligodeoxynucleotides were attached to magnetic beads by biotynilation and were put in contact with variant fd phage. After further purification and sequencing, the results were summarized in a table listing 16 of the 64 possible DNA triplets and a mere 33 of the 2.6XE6 possible C2H2ZF motifs.
To further strengthen their conclusions, the authors extended the analysis in an accompanying paper[5]. In this paper they created a 33x12 array whose rows consisted of the C2H2ZF variants described above and whose columns were the elements of a library of DNA triplets with one position fixed and two randomized. Thus for the example, the nucleotides in the first row all had G in the first position and random nucleotides in the last two. The authors measured the strength of binding association at each of the 396 entries in the array and reported the results in a figure shown, in part, below.

By analyzing the binding preference of each entry in the array, Choo and Klug obtained a "binding site signatures" displayed in the far right column. For example, the 2nd row from the top indicates that the C2H2ZF motif with arginine at position –1 and alanine at positions 3 and 6 of the alpha helix displays a fairly specific preference for the triple GTG. The authors went on to summarize the binding site signatures in a 4x3 table that set out the major trends apparent in the 33 binding site signatures (not shown).
While very informative, the results presented in these two papers could not be used directly to provide automated and approximate prediction of binding sites. To adapt them to this purpose, the image shown above was scanned and a 12x12 pixel sample of the gray-scale valued pixels displayed for each array entry was selected. The locations of the sample points varied somewhat as shown below, however, each sample remained within the corresponding array entry.

An average value was obtained from the 144 pixels sampled and used as a measurement of the strength of binding. The values of the four columns that correspond to a particular base position in the DNA triplet were converted to probabilities, which are shown below (multiplied by 1000).
|
RSDHLTTHIR |
331 |
68 |
519 |
79 |
800 |
113 |
86 |
0 |
750 |
94 |
77 |
77 |
|
RVDALEAHRR |
465 |
209 |
204 |
119 |
150 |
124 |
601 |
124 |
572 |
151 |
144 |
131 |
|
DRASLASHMR |
461 |
50 |
461 |
25 |
68 |
106 |
694 |
129 |
59 |
752 |
153 |
34 |
|
NRDTLTRHSK |
738 |
103 |
79 |
79 |
90 |
71 |
603 |
233 |
77 |
124 |
461 |
336 |
|
QKGHLTEHRK |
289 |
152 |
280 |
277 |
411 |
194 |
207 |
186 |
160 |
289 |
276 |
273 |
|
QSVHLQSHSR |
246 |
120 |
497 |
136 |
550 |
137 |
161 |
149 |
221 |
364 |
217 |
196 |
|
RLDGLRTHLK |
385 |
101 |
406 |
105 |
159 |
283 |
280 |
277 |
560 |
158 |
140 |
140 |
|
TPGNLTRHGR |
552 |
164 |
135 |
147 |
213 |
396 |
198 |
190 |
167 |
161 |