Info

total genes

Fig. 3. Predicted protein products of genes from H. influenzae (1,709 genes), S. cerevisiae (6,241 genes), C. elegans (18,424 genes), and D. melanogaster (13,601 genes). The dark bar segments depict genes coding for unique proteins; the light bar segments depict genes coding for paralogs. (Adapted with permission from Rubin et al. (2000) Science 287: 2204-2215. Copyright 2000, American Association for the Advancement of Science.)

proteins may exist in the proteome. Complete genomic sequences of several organisms have been completed and these data have allowed analysts to predict the products of all the organism's genes. Moreover, based on the predicted amino acid sequences of each gene product, these proteins have been classified on the basis of the domains and sequence motifs they contain. For example, 119 of the genes of the Saccharomyces cerevisiae genome encode proteins with eukaryotic protein kinase domains, whereas 47 others encode proteins with C2H2-type zinc-finger domains. Comparisons of domain-sequence characteristics with genomic sequences reveals many other protein types encoded in an organism's genome.

Recent analyses of the S. cerevisiae, Caenorhabditis elegans, and Drosophila genomes have revealed very interesting relationships between the size of the genomes and the predicted content of the proteomes for these organisms. Gerald Rubin and colleagues have classified the predicted protein products of the H. influenzae, S. cerevisiae, C. elegans, and Drosophila genomes based on the presence of specific domains (Fig. 3). Comparison of all the predicted protein products indicated the occurrence of proteins whose sequence differed only slightly from others in the genome. Correction for these redundant protein products, termed "paralogs," allowed the calculation of a "core proteome" for each organism. This core proteome represents the basic collection of distinct protein families for an organism.

A look at the the core proteomes for these organisms illustrates two interesting aspects of the proteome. First, the relationship between the complexity of an organism and the number of genes in its genome is not simple. Certainly, the yeast has more genes than the bacterium, yet fewer than the worm and the fly. However, the fly (Drosophila melanogaster) is a much more complicated organism than the worm (C. elegans), yet it has fewer genes (13,601 vs 18,424 in the worm) and a smaller core proteome (8065 distinct proteins vs 9543 in the fly). This suggests that biological complexity does not come simply from greater numbers of genes. Instead, more complex regulation of the genes and the functions of the protein products may account for the greater complexity of the fly.

Second, the number of paralogs increases dramatically in the worm and the fly. This reflects the fact that about half of the genes in the worm and the fly are near-duplicates of other genes. These duplicate-containing gene families often appear as gene clusters on the same chromosome.

The recent completion of the human genome sequence has provided evidence that the human genome encodes between 30,000 and 40,000 genes. In view of the tremendous difference in complexity of the human organism compared to the worm, it is indeed surprising that the human genome encodes only about twice as many genes as that of the worm. Reliable estimates of the numbers of unique genes vs paralogs are not yet available. Nevertheless, it is already becoming axiomatic that the complexity of the human organism lies in the diversity of human proteomes, rather than in the size of the human genome.

2.6. Gene Expression, Codon Bias, and Protein Levels

One of the key issues encountered by investigators who study the proteome is how much of a particular protein is expressed in a cell.

Expression levels of proteins vary tremendously, from a few copies to more than a million. It is important to realize in this context that the level of a protein expressed in a cell has little to do with its significance. Essential enzymes of intermediary metabolism or structural proteins often are present at levels in the thousands of copies per cell or more, whereas certain protein kinases involved in cell-cycle regulation are found at only tens of copies per cell. S. cerevisiae contains approx 6000 genes, of which about 4000 are expressed at any given time, based on measurements of mRNA levels.

The level of any protein in a cell at any given time is controlled by:

1) the rate of transcription of the gene, 2) the efficiency of translation of mRNA into protein, and 3) the rate of degradation of the protein in the cell. Gene expression certainly can dictate protein levels to a considerable extent. However, a number of studies indicate that gene expres sion per se does not really correlate that well with protein levels. This finding certainly reflects the influences of the other two factors mentioned earlier. It also is an important reminder of the limitations of gene-expression analyses (such as microarrays).

Many genes are regulated by inducible transcription factors, which are regulated in turn by a wide variety of environmental influences. However, an intrinsic determinant of the level of expression of many genes is a phenomenon referred to as "codon bias." This term describes the tendency of an organism to prefer certain codons over others that code for the same amino acid in the gene sequence. Thus, genes containing codon variants that are less preferred tend to be expressed at a lower level. Calculated codon bias values for yeast genes range from approx -0.2 to 1.0, where a value of 1.0 favors the highest level of gene expression. Most yeast genes display codon bias values of less than 0.25 and are expected to be expressed at relatively low levels.

Studies in yeast have compared protein levels, mRNA expression, and codon bias for a number of proteins. While there is some disagreement as to the particulars, the following generalizations can be drawn.

• Genes with low codon bias values tend to be expressed at low levels, whether analyzed on the basis of mRNA expression or protein levels.

• mRNA levels correlate poorly (r < 0.4) with protein levels when genes with codon bias values of 0.25 or less (i.e., most genes)

are considered. However, the correlation between mRNA levels and protein levels is much higher (r > 0.85) for the most highly expressed genes (i.e., those with codon bias values above 0.5). • Longer-lived proteins appear to be present in higher abundance than short-lived proteins (i.e., those proteins that are degraded rapidly).

Thus, although gene-expression measurements may indicate changes in protein levels, it is difficult to infer protein expression from gene expression.

2.7. Conclusion and Significance for Analytical Proteomics

The proteome in essentially any organism is a collection of somewhere between 30 and 80% of the possible gene products. Most of these proteins are expressed at relatively low levels (101-102 per cell), although some are expressed at much higher levels (104-106 per cell). Regardless of the absolute level of expression of the polypeptide gene products, most proteins exist in multiple posttranslationally modified forms. This situation poses the greatest challenge for proteomic analysis: we must find ways to detect a large number of distinct molecular species, most of which are present at relatively low levels and many of which exist in multiple modified forms. The next section of the book describes the tools we can bring to bear on this daunting analytical problem.

Suggested Reading

Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., et al. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37-40. Coghlan, A. and Wolfe, K. H. (2000) Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast 16, 1131-1145.

Gygi, S. P., Rochon, Y., Franza, B. R., and Aebersold, R. (1999) Correlation between protein and mRNA abundance in yeast. Mol. Cell Biol. 19, 1720-1730. Rubin, G. M., Yandell, M. D., Wortman, J. R., Gabor Miklos, G. L., Nelson, C. R., et al. (2000) Comparative genomics of the eukaryotes. Science 287, 2204-2215.

Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., et al. (2001) The sequence of the human genome. Science 291, 1304-1351.

II Tools of Proteomics

3 Overview of Analytical Proteomics

Before we consider the elements of analytical proteomics in detail, let's sketch out the basic approach. Analytical protein identification is built around one essential fact: most peptide sequences of approximately six or more amino acids are largely unique in the proteome of an organism. Put another way, a typical six amino acid peptide maps to a single gene product. Thus, if we can obtain the sequence of the peptide or if we can accurately measure its mass, we can identify the protein it came from simply by finding its match in a database of protein sequences (Fig. 1). Of course, some hexapeptides may map to more than one protein, but multiple "hits" typically come from highly conserved regions of related proteins (such as the paralogs discussed in Chapter 2). If one can obtain sequences of several peptides that map to the same gene product, this strengthens the validity of the match. Accordingly, the essence of analytical proteomics is to convert proteins to peptides, obtain sequences of the peptides, and then identify the corresponding proteins from matching sequences in a database.

Figure 1 depicts the essential elements of the analytical proteomics approach. Most analytical proteomics problems begin with a protein mixture. This mixture contains intact proteins of varying molecular weights, modifications, and solubilities. Before peptide sequences can be obtained, the proteins must be cleaved to peptides. This is because the mass spectrometers used to measure peptide masses or obtain peptide sequences cannot perform these measurements

From: Introduction to Proteomics: Tools for the New Biology By: D. C. Liebler © Humana Press, Inc., Totowa, NJ

Fig. 1. General flow scheme for proteomic analysis.

identification

Fig. 1. General flow scheme for proteomic analysis.

directly on intact proteins. Although modern MS instruments can obtain a tremendous amount of data even from relatively complex peptide mixtures, simplification of the mixtures allows data to be collected on the greatest number of components.

Thus, to analyze protein mixtures by MS, the highly complex mixture of many components must be separated into somewhat less complex mixtures containing fewer components. It is possible to separate the intact proteins first and then cleave them into peptides. However, it is also possible to cleave the proteins into peptides first and then separate the peptides prior to analysis. The resolution of proteins and peptides and the cleavage of proteins to peptides are described in Chapters 4 and 5.

The peptides are then analyzed by either of two types of mass spectrometers. The first type, referred to as Matrix Assisted Laser Desorption Ionization-Time of Flight (MALDI-TOF) instruments, are used primarily to measure the masses of peptides. The second type, referred to as Electrospray Ionization (ESI)-tandem MS instruments, are used to obtain sequence data for peptides. These instruments are described in Chapter 6.

The data from the mass spectrometers is then used, with the aid of specialized software, to identify peptides and peptide sequences from databases that match the data from the analyses. This essentially establishes the identity of the proteins in the original mixture. This type of matching is done without directly interpreting peptide sequences from the MS data. The use of these software tools and protein-identification approaches is described in Chapters 7-9.

That's basically it. Analytical proteomics is essentially one assay, in which protein mixtures are converted to peptide mixtures, peptide MS data are obtained, and the corresponding proteins are identified by software-assisted database searching. What makes proteomics so powerful is that this one assay can be applied to many different protein samples generated from a variety of experimental designs. What makes proteomics so versatile is the great variety of "front-end" experiments that can be done to obtain the samples to be analyzed by this one assay. These front-end experiments and their applications are the subject of the third part of this book.

4 Analytical Protein and Peptide Separations

4.1. Overview

This chapter describes the approaches used to prepare protein samples for MS analysis. At this stage of proteomic analysis, we must do two things (Fig. 1). First, we must convert proteins to peptides. This is generally done with proteolytic enzymes. Second, we must separate very complex mixtures of proteins or peptides into somewhat less complex mixtures. This gives the MS instruments a better opportunity to obtain useful data on the components of the mixture. There is no obligatory order for these two steps. We can first separate proteins, then digest them and analyze the peptides. Alternatively, we can first digest a complex mixture of proteins to peptides, and then resolve the peptides. Each approach has advantages and drawbacks, which will be discussed here.

4.2. Complex Protein and Peptide Mixtures

Before we get into the approaches to separation and digestion, let's consider why the problem of complex protein mixtures is an issue. The MS instruments used to obtain data on peptides are capable of extracting a great deal of information from relatively complex mixtures. However, our chances of identifying many peptides in a mixture are increased when the complexity of the mixture is decreased. The problem of complexity and how to deal with it can be likened to the problem of printing a book. Imagine printing all the words in this

From: Introduction to Proteomics: Tools for the New Biology By: D. C. Liebler © Humana Press, Inc., Totowa, NJ

Protein separation

Proteins mixture digestion digestion

Peptide mixture separation

* Peptides

Fig. 1. Protein separation and digestion in proteomics analysis.

book on a single page. It could be done, but the resulting page would be essentially black with ink. By dividing the text onto pages, the complexity is reduced. We can read all the words on one page easily. With protein and peptide separations, we take the same approach. We essentially want to feed the peptide mixture into the MS "a page at a time" to maximize the ability of the instrument to read what is there.

Before we describe different types of protein and peptide separations, it is worth considering how many different proteins and peptides we may be dealing with in a typical proteomic analysis. Based on the number of known human genes, a typical human cell may contain about 20,000 different expressed proteins. If we assume that they average about 50 kDa and contain average numbers of lysine and arginine residues, then each would yield about 30 tryptic peptides. Thus, one cell's proteins would yield about 6,000,000 tryptic peptides. As we will see below, these numbers pose a formidable challenge to even the most efficient multidimensional protein and peptide-separation strategies.

4.3. Extracting Proteins from Biological Samples

In any real study, we start with a biological sample: a piece of tissue, a plate of cultured cells, a flask of bacteria, a leaf, and so on. The sample then is usually pulverized, homogenized, sonicated, or otherwise disrupted to yield a soup that contains cells, subcellular components, and other biological debris in an aqueous buffer or suspension. Proteins are extracted from this soup by a number of techniques. For proteomic analysis, the objective here is to recover as much of the protein as possible with as little contamination by other biomaterials (e.g., lipids, cellulose, nucleic acid, etc.) as possible. This is generally done with the aid of:

• Detergents (e.g., SDS, 3-([3-cholamidopropyl] dimethylammonio)-1-propane sulfonate (CHAPS), cholate, Tween), which help to solu-bilize membrane proteins and aid their separation from lipids

• Reductants (e.g., dithiothreitol [DTT], mercaptoethanol, thiourea), which reduce disulfide bonds or prevent protein oxidation

• Denaturing agents (e.g., urea and acids), which disrupt proteinprotein interactions, secondary and tertiary structures by altering solution ionic strength and pH

• Enzymes (e.g., DNAse, RNAse), which digest contaminating nucleic acids, carbohydrates, and lipids.

Investigators in various fields of biology have developed methods to extract proteins from different sample types (e.g, leaves vs cultured cells) and the agents and tricks previously listed are used in different combinations. In some protocols, inhibitors of proteases are commonly used to prevent proteolytic protein degradation. In short, there are many recipes used to extract proteins from biological samples.

One must be aware that some of these agents may interfere with proteomic analysis. For example, phenylmethylsulfonyl fluoride (PMSF), a serine protease inhibitor, is frequently used to prevent protein degradation during tissue processing. However, residual PMSF is some protein samples may inhibit tryptic digestion needed for proteomic analysis. Likewise detergents may interfere both with some analytical protein separations and with proetolytic digestions. Thus, careful attention to the "history" of the sample, particularly how it was harvested and processed, is important to the success of the analytical scheme.

4.4. Protein Separations Before Digestion

In this section, we consider analytical protein separations that are done before the proteins are digested. The three principal separation approaches used with intact proteins are 1D- and 2D-SDS-PAGE and preparative isoelectric focusing (IEF). Although these are most widely used, there are alternatives, particularly HPLC (reverse phase (RP), size exclusion, ion exchange, or affinity chromatography). Regardless of the method used, the idea behind separating intact proteins is to take advantage of their diversity in physical properties, especially isoelectric point and molecular weight. The mixture may be separated into a relatively small number of fractions (as in 1D-SDS-PAGE and preparative IEF) or into many fractions (as in the many spots in 2D-SDS-PAGE). The fractions then are taken for proteolytic digestion followed either by further separation of the peptide fragments or direct MS analysis of the peptides.

4.5. One-Dimensional SDS-PAGE

The single most widely used analytical separation in all of protein chemistry is reasonably useful for proteomic analysis. In 1D-SDS-PAGE, the protein sample is dissolved in a loading buffer that usually contains a thiol reductant (mercaptoethanol or DTT) and SDS (Fig. 2). The separation method is based on the binding of SDS to the protein, which imparts negative charge (from the SDS sulfate group) to the protein in roughly constant proportion to molecular weight. When the gel is subjected to high voltage, the protein-SDS complexes migrate through the cross-linked polyacrylamide gel at rates based on their ability to penetrate the pore matrix of the gel. The proteins thus are resolved into bands in order of molecular weight.

One-dimensional-SDS-PAGE is done on gels in which the extent of cross-linking (i.e., polymerization of the acrylamide) varies from 5-15%, where lower degrees of cross-linking allow easier passage of larger proteins through the gel. One can choose an extent of cross-linking based on expected characteristics of the proteins in the sample. For example, a sample containing low molecular-weight proteins is better resolved on a more highly cross-linked gel. Alternatively, one may choose a gradient gel, where the extent of cross-linking increases from top to bottom of the gel. Gradient gels can provide better resolution of a broad molecular-weight range of proteins.

The degree of resolution achieved by 1D-SDS-PAGE is rather modest and bands that appear to contain a single protein may actually contain multiple molecular species. For example, a gel slice spanning an approx 5 kDa range from a crude cellular extract may contain from

Sds Cholate Mixture

t=0 increasing time

Fig. 2. Schematic representation of 1D-SDS-PAGE.

ffl t=0 increasing time

Fig. 2. Schematic representation of 1D-SDS-PAGE.

dozens to hundreds of different proteins. Even a "purified protein" may contain diverse molecular forms. This is often clearly evident when one compares 1D- and 2D-SDS-PAGE of protein samples. The 1D-SDS-PAGE analysis will often give a single, clean-looking band, whereas 2D-SDS-PAGE of the same sample will resolve the sample into multiple spots along the same molecular-weight band, but with different isoelectric points. This can reflect multiple posttranslational modifications that do not significantly affect SDS binding or migration through the polyacrylamide gel.

As the goal of the protein separations is to reduce the complexity of the mixture, it might seem from the aforementioned that 1D-SDS-PAGE is of little utility in proteomic analysis. Actually, the utility of this separation approach depends on the complexity of the sample. Most 1D-SDS-PAGE separations distribute proteins over a lane of between 5 and 15 cm in length, which then permits slicing of the gel into 5-50 bands without difficulty. For a highly complex protein mixture, such as a whole-cell extract, each fraction (gel slice) may still contain many different proteins and the degree of simplification of the sample is only modest. However, many samples for proteomic analysis will not be whole-cell extracts or similarly complex mixtures. For example, proteomics approaches to studying protein-protein interactions (to be discussed in subsequent chapters) may contain relatively few proteins. Likewise, many biological fluids (e.g., cerebrospinal fluid [CSF], lung-lining fluid) contain a much more limited number of proteins and a 1D-SDS-PAGE separation may be quite appropriate to pre-resolving these mixtures.

4.6. Two-Dimensional SDS-PAGE

This separation method has become synonymous with proteomics and remains the single best method for resolving highly complex protein mixtures. Two-dimensional SDS-PAGE is actually a combination of two different types of separations. In the first, the proteins are resolved on the basis of isoelectric point by IEF. In the second, focused proteins then are further resolved by electrophoresis on a polyacrylamide gel (Fig. 3). Thus 2D-SDS-PAGE resolves proteins in the first dimension by isoelectric point and in the second dimension by molecular weight.

Was this article helpful?

0 0
Healthy Chemistry For Optimal Health

Healthy Chemistry For Optimal Health

Thousands Have Used Chemicals To Improve Their Medical Condition. This Book Is one Of The Most Valuable Resources In The World When It Comes To Chemicals. Not All Chemicals Are Harmful For Your Body – Find Out Those That Helps To Maintain Your Health.

Get My Free Ebook


Post a comment