Academia.eduAcademia.edu
© 2000 Nature America Inc. • http://structbio.nature.com progress Protein NMR spectroscopy in structural genomics Gaetano T. Montelione1, Deyou Zheng1, Yuanpeng J. Huang1, Kristin C. Gunsalus1 and Thomas Szyperski2 © 2000 Nature America Inc. • http://structbio.nature.com Protein NMR spectroscopy provides an important complement to X-ray crystallography for structural genomics, both for determining three-dimensional protein structures and in characterizing their biochemical and biophysical functions. Structural genomics involves the determination, analysis, and dissemination of the three-dimensional structures of all protein and RNA molecules in nature, providing new opportunities at the interface of structural biology, functional genomics, and bioinformatics. This very ambitious goal requires both large-scale structure determination and amplification of these data by high-throughput modeling. It is generally recognized that X-ray crystallography using synchrotron radiation, and multiwavelength anomalous dispersion (MAD) methods1 for determining the phase information required for crystallographic analysis, will play a central role in genomic-scale structural analysis (see the articles by Stevens and colleagues, and Lamzin and Perrakis). Solution state NMR will also have a complementary role in post-genomic analysis, particularly considering that (i) many protein targets do not provide crystals suitable for crystallographic analysis; (ii) some 15–20% of new protein structures are determined by NMR methods; and (iii) sequence-specific resonance assignments provide the basis for various kinds of functional characterization. Strengths and weaknesses of NMR in structural genomics Several features of solution-state NMR make it particularly suitable for structure-function analysis and structural genomics. Structural analysis by NMR does not require protein crystals. Most (∼75%) of the NMR structures in the Protein Data Bank (PDB) do not have corresponding crystal structures, and many of these simply do not provide diffraction quality crystals. Moreover, NMR studies can be carried out in aqueous solution under conditions quite similar to the physiological conditions under which the protein normally functions. This feature allows comparisons to be made between subtly different solution conditions that may modulate structure-function relationships. For example, pH titration data can be used to determine pKa values of specific ionizable groups in the protein and to characterize the corresponding structure-function relationships. While most crystal structures are determined under physiologically relevant conditions, in many cases somewhat exotic solution conditions are required for crystallization. The accuracy of protein structures determined by NMR is very dependent on the extent and quality of data that can be obtained. The highest quality NMR structures have accuracies comparable to 2.0–2.5 Å X-ray crystal structures2. Although atomic positions in high-resolution crystal structures are more precisely determined than in the corresponding NMR structures, the crystallization process may select for a subset of conformers present under solution conditions. For example, while high-quality NMR structures typically exibit root mean square (r.m.s.) deviations of backbone and heavy atoms (excluding those of surface side chains) of 0.3–0.6 Å and 0.5–0.8 Å, respectively, analysis of a set of high-resolution X-ray crystal structures of bovine pancreatic trypsin inhibitor determined in different crystal forms3 indicates similar variations of 0.2–0.6 Å in backbone atom positions due to preferential selection of distinct low energy conformers in the crystallization process. NMR has special value in structural genomics efforts for rapidly characterizing the ‘foldedness’ of specific protein or RNA constructs. The dispersion and lineshapes of resonances measured in 1D 1H-NMR and 2D 15N-1H or 13C-1H correlation spectra provide ‘foldedness’ criteria with which to define constructs and solution conditions that provide folded protein samples (Fig. 1). As the required isotopic enrichment with 15N is relatively inexpensive, and the 2D 15N-1H correlation spectra can be recorded in a b Fig. 1 Comparison of 15N-1H correlation spectra for disordered and well-folded proteins. a, Spectrum of Drosophila melanogaster Par 1 C-terminal domain, a domain construct that is predominantly disordered under the conditions of these measurements (K.G. and G.T.M., unpublished results). b, Spectrum of Thermus thermophilus varient of COG272 protein, a target with well-defined three-dimensional structure in aqueous solution (B. Dixon, S. Anderson, and G.T.M., unpublished results). Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers University, Piscataway, New Jersey 08854-5638, USA. Department of Chemistry, State University of New York at Buffalo, Buffalo, New York 14260, USA. Correspondence should be addressed to G.T.M. email: guy@cabm.rutgers.edu 1 2 982 nature structural biology • structural genomics supplement • november 2000 © 2000 Nature America Inc. • http://structbio.nature.com progress © 2000 Nature America Inc. • http://structbio.nature.com Box 1 Protein structure determination by NMR The determination of a NMR solution structure may be dissected into six major parts. (i) At the outset of the NMR study, a suitable sample, usually ∼500 µL of a 1 mM protein solution is prepared. If the molecular weight of a protein exceeds ∼10 kDa, enrichment with 13C and 15N isotopes is required in order to resolve spectral overlap in 1H-NMR spectroscopy. Due to the availability of high-yield over-expression systems, stable isotope labeling has become routine. (ii) Subsequently, this sample is used to record a set of multidimensional NMR experiments, typically at temperatures around 30 ºC, which provide, after suitable data processing, the NMR spectra. (iii) These allow determination of (nearly) complete sequential NMR assignments (the measurement of resonance frequencies (chemical shifts) of the NMRactive spins in the protein). (iv) The resulting conformation-dependent dispersion of the chemical shifts is a prerequisite for deriving experimental constraints from various NMR experiments (such as NOE, scalar coupling, and dipolar coupling data) for the NMR structure calculation. The circular arrows between steps (iv) and (v) indicate that the analysis of structural constraints and the calculation of NMR structures is generally pursued in an iterative fashion. (v) Iterations involving structure calculations and identification of new constraints are carried out until the overwhelming majority of experimentally derived constraints is in agreement with a bundle of protein conformations representing the NMR solution structure. Conformational variations in the bundle of structures reflect the precision of the NMR structure determination. (vi) Finally, the NMR structure can be refined using conformational energy force fields, which in essence reflect our current knowledge about conformational preferences of proteins. tens of minutes with conventional NMR systems, it is quite feasiLarge multidomain proteins are generally not suitable for ble to use such data as a ‘foldedness’ screen in a high throughput NMR analysis. However, these can also exhibit interdomain flexsample preparation pipeline. Moreover, there may be correlations ibility, which can complicate or prevent crystallization. between such ‘foldedness’ criteria and crystallizability, so that Fortunately, many of these larger proteins are composed of data from a high throughput NMR screen might directly support structural domains13–15, with an average size of ∼175 amino efforts to generate samples for crystallographic analysis. acids. Indeed, much of the structural information available for Protein backbone chemical shift assignments are obtained at the such larger proteins comes from X-ray and NMR studies of isoinitial stage of a structure determination (see Box 1), and can often lated domains. In this regard, both experimental and theoretical be generated in a fully automated fashion4. These data provide methods for parsing large multidomain proteins into experimental determination of locations of secondary structural autonomously folding domain segments are critical to the generelements5,6, which is more reliable than that provided by secondary al aims of structural genomics. structure prediction algorithms. This knowledge is tremendously NMR is particularly valuable in structural genomics for analyzenabling for fold prediction algorithms. Such fold predictions form ing protein structures that are outside the scope of crystallographic the basis for functional predictions7 and can also be used for prior- studies. Included in the classes of proteins that do not form crystals itizing targets for further experimental structure analysis. suitable for crystallographic analysis are those that are partially NMR also provides a powerful tool for downstream characteri- unfolded in the absence of binding partners, as well as some memzation of structure-function relationships, a critical component brane-associated proteins that can be studied in micelle environof the process of structure-based functional genomics8. Chemical ments using solution-state NMR. Solid state NMR methods can shift perturbation provides an important tool for validating pro- also provide structural information for some integral membrane posed biochemical functions, screening for small molecule lig- proteins that may not be accessible by crystallographic methods. ands, mapping ligand binding epitopes, and drug development9. NMR spectroscopy is relatively insensitive, which severely Moreover, it is generally appreciated that the thermodynamics limits experimental design. Typically samples at ∼1 mM protein and mechanisms of molecular function depend on changes in internal dynamics, which can be characterized using nuclear relaxation measurements10. Although significant progress has been made in determining resonance assignments and low resolution structures of larger systems11,12, standard methods for high resolution structure analysis by NMR are limited to proteins with molecular weights less than 25–30 kDa. The size distribution of ORFs in some genomes is shown in Fig. 2. Even though many of these ORFs code for oligomeric proteins, proteins that are folded only in the presence of binding partners, or integral membrane proteins, we estimate that at least 25% of yeast ORFs will be suitable for NMR structure determination with current methodologies. In higher eukaryotic genomes, this fraction of small Fig. 2 Distribution of predicted open reading frame (ORF) lengths in the genomes of ORFs is somewhat lower. Nonetheless, there are thou- Escherichia coli (blue), Saccharomyces cerevisiae (red), Caenorhabditis elegans (yelsands of full-length ORF targets that will be suitable low), and Drosophila melanogaster (green). Assuming monomeric structures, the length cut-off for routine NMR studies is ∼300 amino acids (dotted vertical line). for NMR structure determination. nature structural biology • structural genomics supplement • november 2000 983 © 2000 Nature America Inc. • http://structbio.nature.com progress © 2000 Nature America Inc. • http://structbio.nature.com concentration are required, preventing studies of proteins with very low solubilities. Because of constraints on pulse sequence design arising from these sensitivity limitations, several different NMR spectra recorded over a four to six week period are necessary to obtain the information needed for a high-quality structure determination. These long data collection periods, in turn, put significant constraints on sample stability. Although multiple samples can be used in the structure determination process, each one must be stable for days to weeks with respect to precipitation, aggregation, and other forms of degradation. Manual analysis of these multiple NMR data sets is laborious and requires significant expertise. Another important limitation of NMR analysis is that the density of constraints is sometimes inadequate for accurate structural analysis. In particular, general methods for cross validation analogous to a free R-factor, a statistical measurement used in crystallographic studies to evaluate how well a structural model fits the diffraction data, are not yet available. a b c d Recent technological advances The reduction of the data collection time required for a structure determination is a major challenge for NMR-based structural genomics. Technological advances enhancing sensitivity, such as the construction of new high-field magnets are of keen interest. The sensitivity of the acquired NMR data depends critically on the performance of the NMR probe, a sophisticated electronic device used to detect NMR signals. In the near future, the introduction of cryogenic probes is expected to have a significant impact. Radiofrequency (RF) coils constitute the heart of these probes, and their sensitivity scales with the thermal noise associated with the coil’s temperature. Cryogenic probes utilize RF-coils cooled to ~25 K, and the resulting sensitivity enhancement reduces instrument time requirements by factors that range from 4 to 16. Another key advance involves partial deuteration12, providing samples that can be studied with improved signal-to-noise ratios that result from their sharper linewidths and longer transverse relaxation times. The combination of partial deuteration and cryogenic probes can provide a factor of 10 or more reduction in the requisite data collection times. These technologies provide the basis for high throughput NMR, and are particularly valuable for samples exhibiting limited stabilities and/or low solubilities. A novel spectroscopic concept named TROSY (transverse relaxation optimized spectroscopy), based on selection of slowly relaxing NMR transitions, also can provide significant sensitivity enhancement for large proteins11,16,17 and may become a prerequisite to extend structural genomics by NMR into the 30–50 kDa molecular weight range. NMR structure determinations rely on the nearly complete assignment of chemical shifts, which are obtained using multidimensional 13C,15N,1H-triple resonance NMR methods (for recent technical reviews see refs 12, 17, and 18). However, a complete set of these experiments often requires far more instrument time than the minimum dictated by signal-to-noise (S/N) requirements. A particular challenge for structural genomics is the development of NMR experiments that allow matching of instrument time investments to the minimum time required for measuring the chemical shift data. For many samples, most of the instrument time is needed not to detect signal, but to ensure appropriate resolution and/or information content of the spectra. In particular, lower bounds for the measurement time of three- and four-dimensional experiments are often determined by digital resolution requirements in the indirect dimensions rather than S/N requirements. Reduced dimensionality experi- ments19,20, with simultaneous frequency labeling of more than one atom type in indirect dimensions, offers an attractive solution that matches data collection times with signal-to-noise, and requiring minimal sets of NMR experiments for resonance assignment20. Traditional NMR structure determination relies on measurement of nuclear Overhauser effects (NOEs; through-space dipolar interactions between protons) and scalar couplings (through-bond interactions between nuclei mediated by nuclear-electron interactions) for deriving distance and torsion angle constraints, respectively. NOE constraints will continue to be key for high-throughput structure determination, but the arsenal of techniques that have recently been developed to recruit additional experimental parameters for structure refinement will play a valuable role in structural genomics. First, measurement of residual dipolar 1H-15N and 1H-13C couplings in dilute liquid crystalline media (aqueous solutions containing suitable amounts of bicelles21 or filamentous phage22 to help constrain the orientation of the protein under study) offers qualitatively new structural information. Dipolar coupling constraints can establish the spatial relationship of remote segments of a biological macromolecule and can complement sparse NOE networks for obtaining high-quality structures23. Current limitations for use in structural genomics are the efficient identification of suitable orienting media in which the protein sample remains soluble. Second, chemical shifts (the NMR resonance 984 nature structural biology • structural genomics supplement • november 2000 Fig. 3. Results of automatic analysis of protein structures from NMR data. Comparison of backbone structures of basic fibroblast growth factor (FGF) determined by a, manual analysis of NMR data (PDB code 1bld), b, automated analysis of the same NMR data using the program AutoStructure (Y.J.H, R. Tejero and G.T.M., unpublished) or c, X-ray crystallography (PDB code 1bas). Only residues 28–152 are shown, as the N-terminal segment is not well-ordered in either the X-ray or solution NMR structure, and a few C-terminal residues are not defined in the X-ray crystal structure. The average root mean square (r.m.s.) deviation of backbone atom positions between the AutoStructure and manuallydetermined NMR structures is 0.6 Å. d, Superposition of 10 NMR structures computed with AutoStructure. The average r.m.s. deviation of core backbone atoms relative to the mean coordinates is 0.3 Å. © 2000 Nature America Inc. • http://structbio.nature.com © 2000 Nature America Inc. • http://structbio.nature.com progress frequencies) have long been recogTable 1 Web sites related to the use of NMR in structural genomics nized as a potential source for structural refinement. In particular, 13Cα Center or Consortium URL and 13Cβ shifts offer a robust means BioMagResBank www.bmrb.wisc.edu to map the secondary structure and Harvard Structural Genomics of Cancer sbweb.med.harvard.edu/~sgc/ to derive backbone dihedral angle Initiative, USA constraints at an early stage of the New Jersey Commission on Science and www-nmr.cabm.rutgers.edu/structuralgenomics structure determination5,6. They are Technolology Initiative in Structural obtained during the resonance Genomics, USA assignment process, and are thus of Northeast Structural Genomics www.nesg.org outstanding value for efficient highConsortium, USA throughput efforts. Third, detection Protein Structure Factory, Germany userpage.chemie.fu-berlin.de/~psf/ of through-hydrogen bond scalar Riken Genome Sciences Center, Tokyo, Japan www.gsc.riken.go.jp 24 couplings affords valuable unam- Toronto Structural Proteomics Project, Canada nmr.oci.utoronto.ca/arrowsmith/proteomics biguous constraints for characterizing hydrogen-bonded networks, although the small size of these couplings may restrict this to Conclusions smaller proteins. Protein NMR provides structural and biophysical information that is complementary to X-ray crystallography, and these two Automated data analysis methods will play synergistic roles in the postgenomic analysis Another important area of development involves automated and structural genomics. Indeed, NMR is already playing key analysis of NMR data. It has been recognized for some time that roles in several of the established pilot projects. The primary many of the interactive tasks carried out by an expert in the challenges to NMR for high throughput applications are the necprocess of spectral analysis could, in principle, be carried out essarily long time periods for data collection and the laborious more efficiently and rapidly by computational systems. Recent expert reasoning need for data analysis. Recent advances in developments provide automated analysis of NMR assignments probe design, data collection strategies, and software engineerand three-dimensional structures of proteins ranging from ∼50 ing demonstrate the potential for higher throughput data collecto 200 amino acids4,18. When good quality data are available, tion and automated structure analysis. automated analysis of protein NMR data can be very rapid. Many of the available resonance assignment programs execute in tens of seconds4,18, and automated structure refinements are Acknowledgments being carried out in tens of minutes using arrays of processors We thank S. Anderson for useful discussions. The NMR data for FGF were provided by R. Powers and F. Moy (Wyeth Ayerst Research Laboratories). for course-grain parallel calculations (Fig. 3). However, while G.T.M. is supported by grants from the New Jersey Commission on Science and progress over the last few years is encouraging, more work is Technology, The National Science Foundation, and the Merck Genome Research required, even for small proteins, before automated structural Institute. K.C.G. is supported by Postdoctoral Fellowship Award from the NIH. analysis is routine. In particular, general methods for automated analysis of side chain resonance assignments are not yet well Associations with structural genomics developed, and there are as yet no examples of completely auto- G. T. M. is Director of the New Jersey Commission on Science and Technology mated protein structure determinations. Moreover, little work Initiative in Structural Genomics and Bioinformatics has focused on the specific problems associated with nucleic acid 1. Hendrickson, W. Science 254, 51–58 (1995). 2. Billeter, M. Q. Rev. Biophys. 25, 325–377 (1992). structure determinations. 3. Kossiakoff, A.A., Randal, M., Guenot, M. & Eigenbrot, C. Proteins Struct. Funct. Pilot projects using NMR for structural genomics In view of these technological advances and the unique opportunities presented by the genomic sequence data, several research groups and consortia have initiated pilot projects using NMR in structural genomics (Table 1). The scales of these efforts range from the effort at Rutgers University in the USA funded by the New Jersey Commission on Science and Techonology, which focuses primarily on technology development, to the RIKEN Genome Sciences Center in Japan, which is in the process of installing some twenty high field NMR spectrometers to be used largely for high throughput structural genomics. Also particularly noteworthy is the structural genomics pilot project organized by researchers at University of Toronto in Canada, in which isotope-enriched samples of proteins encoded by the genome of Methanobacterium thermoautotrophicum have been distributed to several NMR groups for parallel data collection and structure analysis, resulting in some dozen three-dimensional structures over the last year. nature structural biology • structural genomics supplement • november 2000 Genet. 14, 65–74 (1992). Moseley, H.N.B. & Montelione, G.T. Curr. Opin. Struct. Biol. 9, 635–642 (1999). Wishart, D.S. & Sykes, B.D. J. Biomol. NMR 4, 171–180 (1994). Cornilescu, G., Delaglio, F. & Bax, A. J. Biomol. NMR 13, 289–302 (1999). Fetrow, J.S. & Skolnick, J. J. Mol. Biol. 281, 949–968 (1998). Montelione, G.T. & Anderson, S. Nature Struct. Biol. 11–12 (1999). Shuker, S.B., Hajduk, P.J., Meadows, R.P. & Fesik, S.W. Science 274, 1531–1534 (1996). Palmer, A.G., Williams, J. & McDermott, A. J. Phys. Chem. 100, 13293–13310 (1996). 11. Wüthrich, K. Nature Struct. Biol. 5, 492–495 (1998). 12. Gardner, K.H. & Kay, L.E. Annu. Rev. Biophys. Biomol. Struct. 27, 357−406 (1998). 13. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. J. Mol. Biol. 247, 536–540 (1995). 14. Holm, L. & Sander, C. Science 273, 595–602 (1996). 15. Orengo, C.A., et al. Structure 5, 1093–1108 (1997). 16. Pervushin, K., Riek, R., Wider, G. & Wüthrich, K. Proc. Natl. Acad. Sci. USA 94, 12366–12371 (1997). 17. Wider, G. & Wüthrich, K. Curr. Opin. Struct. Biol. 9, 594–601 (1999). 18. Montelione, G.T., Rios, C.B., Swapna, G.V.T. & Zimmerman, D.E. In Biological magnetic resonance (eds Krishna, R. & Berliner, L.) 81–130 (Klewer Academic/Plenum Publishers, New York; 1999). 19. Szyperski, T., Wider, G., Bushweller, J.H. & Wüthrich, K. J. Am. Chem. Soc. 115, 9307–9308 (1993). 20. Szyperski, T., Banecki, B., Braun, D. & Glaser, R.W. J. Biomol. NMR 11, 387–405 (1998). 21. Tjandra, N. & Bax, A. Science 278, 1111–1114 (1997). 22. Hansen, M.R., Mueller, L. & Pardi, A. Nature Struct. Biol. 5, 1065–1074 (1998). 23. Prestegard, J.H. Nature Struct. Biol. 5, 517–522 (1998). 24. Cordier, F. & Grzesiek, S. J. Am. Chem. Soc 121, 1601–1602 (1999). 4. 5. 6. 7. 8. 9. 10. 985