Alastair Fyfe

CMPS244 Project – Spring ‘01

Web Copy : http://www.cse.ucsc.edu/~afyfe/244p2.htm

Track on the Class Mirror Browser : c2h2zf

 

C2H2 Zinc Fingers in the Human Genome: Motif Occurrences and Binding Site Prediction

 

    1. Introduction
    2. Cysteine-2 Histidine-2 Zinc Fingers
    3. The Search for a Recognition Code
    4. Finding Instances of the C2H2ZF Motif in the Human Genome
    5. Quantifying Binding Preferences
    6. Predicting Binding Sites
    7. Discussion
    8. Further Work
    9. Acknowledgements
    10. References

 

 

Introduction.

Protein binding to nucleic acids is an essential step in many biological processes including gene expression, DNA duplication and DNA packing. Proteins have evolved a wide array of mechanisms for binding to DNA, each fit for a specific task. Mechanisms can be analyzed and classified by various criteria of which two of the most important are affinity and specificity. For example, the processive DNA polymerases responsible for the bulk of DNA duplication bind single-stranded DNA with great affinity but little specificity. Other proteins, such as the large T antigen protein responsible for recognizing the origin of replication in duplication of SV40 DNA,bind relatively weakly but with great specificity. A recent survey of DNA binding proteins [2] organized the 240 DNA binding proteins whose solved structures are available in the Protein Data Bank into 8 overall classes which are in turn divided into 54 structural families.

In this report I will concentrate on the cysteine-2 histidine-2 zinc finger (C2H2ZF) DNA-binding motif. This motif occurs very commonly in eukaryotic transcription factors and is among the best characterized of the DNA binding motifs. Its modular structure and relatively simple mode of binding have motivated the search for a "recognition code" whose discovery would enable the engineering of proteins capable of binding to any desired DNA sequence. Development of this capability is an essential prerequisite for therapeutic technologies, such as gene therapy, that depend on being able to target specific elements within the human genome.

The three main goals of the project were to:

    1. discover instances of C2H2ZF motifs in the current draft of the human genome and analyze their characteristics,
    2. review current literature on the status of a C2H2ZF "recognition code" and try to adapt applicable results into a tool capable of predicting likely binding sites,
    3. use this tool to scan the genome and characterize the properties of the reported putative binding sites

The next two sections review the C2H2ZF motif with the first focusing on its structure and properties and the second on a review of efforts to uncover a DNA recognition code. The fourth section summarizes some properties of the C2H2ZF motif occurrences found in the human genome and explains how those motifs were identified. The next two sections focus on the prediction of binding sites. The first of the two describes the methods used to adapt existing studies into a tool for predicting C2H2ZF binding sites and the second describes the predictions obtained for a particular protein, ZNF151, The final two sections discuss some of the strengths and limitations of the approach used and suggest some interesting areas for additional work

 

Cysteine-2, Histidine–2 Zinc Fingers

The first characterization of the C2H2ZF motif was done by Aaron Klug at MRC, Cambridge University, in 1985 [1] The motif was found in the TFIIIA transcription factor for the 5S ribosomal RNA gene from Xenopus laevis, a protein that has the distinction of being the first known eukaryotic transcription factor. Inspection of the protein revealed the presence of 9 repeats of a 30 residue pattern and an abundance of zinc. Klug correctly predicted a structure in which the two Cys and His residues conserved in each repeat  tetrahedrally coordinate a zinc ion.

This was the first member of what has become a class of zinc-coordinated binding proteins that includes the hormone-nuclear receptor and GAL4-type families[2]. Hence the C2H2ZF motif  is often referred to as the "classic" zinc finger.

The motif consists of about 30 amino acids. Its secondary structure elements are an a-helix and two anti-parallel b sheets. The two sheets and the helix are held in a fixed conformation by the coordinated zinc ion and by 3 conserved hydrophobic residues [3].

The two model systems that have been studied most extensively are zif268, a three-finger transcription factor from mice involved in early stages of development, and TFIIIA a nine-finger transcription factor essential for transcription of 5S ribosomal RNA in Xenopus.  The first crystal structure of a C2H2 zinc finger was published by Pavletich and  Pabo in  ’91 [6] and a refined structure at 1.6A was published in ’96 [10] with PDB entry 1AAY. An image drawn from the 1AAY coordinates and available in the Protein-Nucleic Acid Complex Database, http://www.rtc.riken.go.jp/jouhou/3dinsight/complexdb.html,  is shown below.

From the image it is apparent that residues in the alpha helix are in close contact with Watson-Crick base pairs in the DNA major groove. The conventional numbering for the a-helix residues references the first residue immediately before the helix as –1.

C2H2ZF motifs possess a number of interesting characteristics:

  1. They are widespread : over a thousand instances of the motif have been identified, often in transcription factor proteins.  They have been found in all eukaryotes studied to date and apparently do not occur in prokaryotes.
  2. They rely on a relatively simple DNA-binding structure which reads a  triplet of bases from the DNA major groove. Fingers can be combined, via “linkers”, to form polydactyl motifs capable of recognizing longer sequences.
  3. They have been the subject of extensive study and are among the best understood of the DNA-binding proteins.
  4. Synthetically engineered C2H2ZF proteins have shown great success in binding targeted DNA binding sites with considerable specificity and affinity.
  5. C2H2ZF proteins have also been studied in RNA and protein-protein binding.

 

The Search for a Recognition Code

The study of how proteins recognize specific DNA sequences has long fascinated molecular biologists. This section reviews progress in answering this question in the context of the C2H2ZF motif.

 A 1976 paper by Seeman, Rosenberg and Rich[11] set out some important principles. The authors asked what pattern of hydrogen bonding would allow an amino acid inserted in the major groove of B-DNA to distinguish among the 4 possible Watson-Crick pairs. From inspection of the stereochemistry they concluded that, on the basis of hydrogen-bond based discrimination alone,  (a) a single hydrogen bond was inadequate for discrimination and that (b) a few two-hydrogen bond interactions were likely. These include a two-bond system between either asparagine or glutamine and the adenine side of a U(T)-A pair and another two-bond system between arginine and the guanine  in a GC pair. Subsequent analysis has borne out these predictions. This paper led to an early appreciation of the fact that only a few of the interactions necessary for protein- DNA recognition are readily identifiable.

In 1991, Pavletich and Pabo [6] solved the structure of the three finger zif 268 protein complexed with DNA. Their crystallographic analysis provided experimental support for the predictions made by Seeman et al. The three triples in the 5’ GCG TGG GCG 3’ sequence used in the formation of the protein-DNA complex was rich in guanines and five of the six observed protein-DNA hydrogen bonds occurred between arginine and guanine.

In a study published in 1992 Jacobs [8] analyzed 1340 C2H2ZF motifs obtained from 221 proteins, primarily transcription factors obtained by scanning protein databases and the literature for the appropriate pattern. Jacobs analyzed the pattern of variation and reasoned that sites that tended to vary significantly within a multi-finger protein but be relatively conserved across similar proteins were probably involved in DNA binding, i.e. in sequence recognition. He concluded that positions –1, 3 and 6 were most likely to be involved in DNA recognition. These results agreed with analysis of the crystal structure and suggested that a protein<>DNA recognition code could be found by focusing on three specific positions within the C2H2ZF motif.

A number of different studies in the early ‘90s used combinatorial chemistry techniques to try to elucidate this code. The most comprehensive was reported in two papers by Yen Choo and Aaron Klug [4][5]. These studies built variants of zif268 by randomly altering the amino acids at helix positions –1 to 8 (excluding the conserved Leu and His ) of the middle finger. The ability of variants to bind to all possible 64 DNA triples was then tested by means of phage-display based selection. Their results supported the identification of residues –1, 3 and 6 as the key determinants of binding specificity, though position 2 was also found to play an important and less direct role. The data published in this study forms the basis of the binding-site predictions computed in this project as explained in the section “Quantifying Binding Preferences”.

More recent studies have made apparent that no simple C2H2ZF recognition code is likely to be found. The solution of the structure of the GLI protein and detailed comparison of the geometry of different C2H2ZF docking arrangements in known structures[12] indicates that a number of distinct arrangements of the motif exist, each with a distinct recognition code. Even for the canonical zif268 docking arrangement, factors such as linker length, and both inter-finger and intra-finger correlation have been found to play a significant role. Nevertheless, while the overall recognition code, even for zif268, may not be simple, there is good reason to expect that accumulation of sufficient experimental data will yield a database from which high-quality predictions can be computed.

Finding instances of the C2H2ZF motif in the human genome

This section describes how occurrences of C2H2ZF motifs in the human genome were identified and summarizes some properties of those occurrences. This data is already available elsewhere: several of the existing annotations of human genome data include information about the protein family membership of gene products. For example, querying the UCSC genome browser for "zinc finger" returns information about 460 mRNA associated gene locations. However, for this project I was interested in investigating the mechanics of locating a particular protein motif and thus did not rely on the existing classifications.

Overall, the search strategy consisted of three steps. First, the protein sequences that correspond to the translation product of known genes were extracted. This collection of protein sequences was then searched with the HMMER[14] search tools using the PFAM[13] hmm for the C2H2ZF motif. This search yielded a set of protein subsequences that were likely instances of the motif. Finally, the highest scoring entries in this set were post-processed to aggregate adjoining fingers into polydactyl motifs.

HMMER is a collection of hidden Markov model (HMM) tools developed and distributed by Sean Eddy at the University of Washington. The suite includes tools for building profile HMMs from a multiple alignment, estimating the distribution function of scores calculated by an HMM and searching a database of sequences using an existing HMM. I used version 2.1.1 of the package

The PFAM database[13] maintains a collection of hand-curated multiple alignments associated with known protein folds. For each such alignment, the database also provides a profile HMM whose states and emission probabilities are derived from the alignment. The HMM used in this project is named "zf-C2H2" with accession number PF00096. The associated alignment includes 10312 proteins, a number that attests to the widespread distribution and well-characterized nature of this domain.

The protein sequences used for the project were obtained for the October 7, 2000 version of genome database using annotation "genieKnownPep" These are 8243 protein sequences of known genes. Application of the zf-C2H2 HMM yielded 1825 possible motif instances in 404 genes. If the score value for a motif was low enough that the expected number of occurrences with that score value was 1 or greater, the score was not used. This cutoff reduced the number of motifs to 1524.

The next step was to assemble adjacent fingers into separate polydactyl motifs. The distribution of linkers is shown below where linkers longer than 21 have been aggregated.

Linker length

Occurrences

0

209

1

1

2

1

3

9

4

7

5

23

6

1042

7

21

8

17

9

16

10

6

11

4

12

2

13

1

16

3

17

2

18

4

19

3

20

1

21 and higher

150

 

The common occurrence of a linker of size six is consistent with the canonical five-residue linker in zif268, though it is not yet clear whether the single residue discrepancy is real or simply due to the way adjacent alignments were calculated for this project. A sequence logo, drawn by the WebLogo server (http://www.bio.cam.ac.uk/seqlogo/) ,  for the 1042 linkers of size 6 is shown below. Again, the composition of the linker is in very good agreement with the canonical  PKEGT linker sequence that occurs in more than half[3] of 5 residue linkers in the Transcription Factors Database.

 

For purposes of prediction, one of the most interesting characteristic of  C2H2ZF motifs is the distribution of  residues at the key –1, 3 and 6 positions of the a-helix.

 

Helix Position:

-1

3

6

Ala

18

102

89

Cys

41

17

6

Asp

81

84

28

Glu

44

84

95

Phe

15

25

6

Gly

24

55

21

His

108

228

21

Ile

6

17

94

Lys

69

53

171

Leu

41

38

72

Met

18

17

26

Asn

68

224

73

Gln

324

95

211

Arg

294

39

290

Ser

120

245

83

Thr

105

90

89

Val

31

42

119

Trp

52

1

1

X(err)

3

1

0

Tyr

62

60

28

Pro

0

7

1

 

The large number of Asn, Glu, Gln, Arg and His residues is consistent with the expected occurrence of residues capable of forming two hydrogen bonds with Watson-Crick pairs in the major groove. The common occurrence of Ser is more difficult to account for.

 

Quantifying Binding Preferences

The "binding sites signatures" reported in Choo and Klug in [5] provide the most comprehensive published data to date on the specificity of DNA triplet binding obtained for a zif268-like C2H2ZF motif from a background of randomly selected amino acids. This section briefly reviews the methods they used and then describes how their published data was adapted for use in this project.

The phage display protocol used by Choo and Klug [4] entailed three steps. In the first step, a library of about 2.6E6 variants of wild-type zif268 were cloned and ligated into the gene encoding the protein coat of fd phage. These combinatorial variations of zif268 were designed by randomly varying the nucleotides responsible for encoding positions –1 to 8 of the middle finger of zif268. Following transcription and translation, the resulting fd phages displayed one of the synthetic C2H2ZF fingers on their protein coat. The second step of the protocol involved affinity purification. A library of oligodeoxynucleotides was prepared which included the appropriate sequences for binding by fingers 1 and 3 of the modified zif268 and all 64 possible combinations for the middle, finger two,  binding site. The oligodeoxynucleotides were attached to magnetic beads by biotynilation and were put in contact with variant fd phage. After further purification and sequencing, the results were summarized in a table listing 16 of the 64 possible DNA triplets and a mere 33 of the 2.6XE6 possible C2H2ZF motifs.

To further strengthen their conclusions, the authors extended the analysis in an accompanying paper[5]. In this paper they created a 33x12 array whose rows consisted of the C2H2ZF variants described above and whose columns were the elements of a library of DNA triplets with one position  fixed and two randomized. Thus for the example, the nucleotides in the first row all had G in the first position and random nucleotides in the last two. The authors measured the strength of binding association at each of the 396 entries in the array and reported the results in a figure shown, in part,  below.

 

 

By analyzing the binding preference of each entry in the array, Choo and Klug obtained a  "binding site signatures" displayed in the far right column. For example, the 2nd row from the top indicates that the C2H2ZF motif with arginine at position –1 and  alanine at positions 3  and 6 of the alpha helix displays a fairly specific preference for the triple GTG. The authors went on to summarize the binding site signatures in a 4x3 table that set out the major trends apparent in the 33 binding site signatures (not shown).

 

While very informative, the results presented in these two papers could not be used directly to provide automated and approximate prediction of binding sites. To adapt them to this purpose, the image shown above was scanned and a 12x12 pixel sample of the gray-scale valued pixels displayed for each array entry was selected. The locations of the sample points varied somewhat as shown below, however, each sample remained within the corresponding array entry.

 

 

An average value was obtained from the 144 pixels sampled and used as a measurement of the strength of binding. The values of the four columns that correspond to a particular base position in the DNA triplet were converted to probabilities, which are shown below (multiplied by 1000).

 

RSDHLTTHIR

331

68

519

79

800

113

86

0

750

94

77

77

RVDALEAHRR

465

209

204

119

150

124

601

124

572

151

144

131

DRASLASHMR

461

50

461

25

68

106

694

129

59

752

153

34

NRDTLTRHSK

738

103

79

79

90

71

603

233

77

124

461

336

QKGHLTEHRK

289

152

280

277

411

194

207

186

160

289

276

273

QSVHLQSHSR

246

120

497

136

550

137

161

149

221

364

217

196

RLDGLRTHLK

385

101

406

105

159

283

280

277

560

158

140

140

TPGNLTRHGR

552

164

135

147

213

396

198

190

167

161