Reconstruction of functions from environmental genomes

An exciting new development in microbial geno-mics is the exploration of communities by isolating large fragments of DNA directly from the environment and cloning them into vectors such as BACs, followed by probing, sequencing, or screening for functions (Ball and Trevors 2002;

Handelsman 2004; Riesenfeld et al. 2004a; Tiedje and Zhou 2004; Allen and Banfield 2005; DeLong 2005; Schleper et al. 2005). This approach, designated community genomics or metagenomics, allows insight into microbial diversity, and it may lead to the discovery of new genes and functions, novel metabolic pathways, and previously unknown properties of microorganisms that cannot be cultured. Most importantly, metagenomic analysis may lead to reconstruction of functions from genome sequences of organisms that have never been cultured. Comparative analysis of different environments has demonstrated that metagenomes contain habitat-specific signatures that can be used for environmental diagnosis (Tringe et al. 2005). Sometimes genes found in metagenomes may be 'brought to life' in the laboratory by expressing the DNA segment in a suitable host.

A characteristic property of the metagenomics approach is that genes and functions are studied without consideration of the species from which the DNA derives. The metagenome of a habitat thus consists of the collective genomes of all organisms together. Of course, such an approach has its limitations, because a specific cell environment and the joint expression of several genes together in a delimited volume are often crucial to the function. In addition, the genomes of different species will differ in dynamics and responses to environmental change, and so the composition of a metagenome could be highly variable in time (DeLong 2001). Nevertheless, several important discoveries have been made by probing communities in this way, as we will see in the examples discussed below.

Two different approaches may be discerned for screening environmental genome libraries: function-driven screening and sequence-driven screening (Schloss and Handelsman 2003). In the first approach, the aim is to identify clones in the library that express a certain function, often one that

Figure 4.14 Comparison of nirS (cytochrome cd1-containing nitrite reductase) gene diversity in two samples taken from sediments of the Choptank river, Maryland, USA. A total of 64 nirS probes (each 70 bp) were spotted on a microarray and the similarity between these sequences is shown as a dendrogram. Probes were developed from environmental nirS sequences isolated earlier (indicated by codes) and from pure cultures of bacteria (indicated by species names). Positive hybridizations are shown in white type on a black background. The diversity of nirS sequences is greater in the upstream location (CR1A) than in the downstream location (HP). After Taroncher-Oldenburg et al. (2003) by permission of the American Society for Microbiology.

has potential applications in medicine, agriculture, or ecology. For example, one may be interested in genes from biosynthetic pathways for antibiotics or genes associated with crucial links in biogeo-chemical cycles. Usually the frequency of active clones is quite low, so one needs a simple assay by which large numbers of clones in a library can be tested quickly. A clever, high-throughput, method was developed for this purpose by Uchiyama et al. (2005). The authors made use of the fact that many genes are induced by the substrate that they cata-bolize. A vector containing green fluorescent protein, suitable for shotgun cloning, was used with the effect that host cells with an insert carrying the promoter of the target gene expressed green fluorescent protein in the presence of the target substrate; these cells could then be sorted by an automated cell-sorting system.

In the second type of screening, sequence-driven screening, one uses hybridization probes to detect clones containing a desired known sequence. The probe can be a phylogenetic anchor, such as a 16 S rRNA sequence, or a specific functional gene. A variant to sequence-driven screening was proposed recently by Sebat et al. (2003), who screened a metagenomic library with a microarray. In their study, the microarray consisted of probes from a stable reference community, cloned in a cosmid library. Hybridizations with the metagenomic library were evidence of the presence of certain species. A great advantage of microarray-based screening of libraries is that many probes are used in parallel.

Library screening is usually followed by partially or completely sequencing the clones of interest. In addition, the library may also be sequenced from the start, without prior screening, if one is interested in all sequences. In this case the library is not prepared with large-insert cloning vectors such as BACs, but with small-insert plas-mid vectors that are suitable for direct sequencing. By picking out clones at random and following the WGS philosophy (see Section 2.2), assembly of multiple complete genomes is then attempted. We will discuss examples of all these approaches below.

4.4.1 Marine community genomics

One of the most appealing results of community genomics is the discovery by Beja et al. (2000a) of proteorhodopsin in the ocean. As we have seen above, proteorhodopsin is a retinal-dependent light-driven proton pump, which may support a photoheterotrophic lifestyle of marine bacteria. The widespread presence of this pathway in the carbon cycle was an unexpected outcome of genomics exploration. Crucial to the approach applied by Beja et al. (2000a) was the construction of a large-insert BAC library after preparation of DNA using a special type of electrophoresis, pulsed-field gel electrophoresis (Beja et al. 2000b). With this technique, applied to environmental DNA digests, it was possible to isolate high-molecular-mass DNA fragments up to several hundred kilobase pairs. The library was screened with 16 S rRNA probes to survey the taxonomic diversity and to find clones with new species of Archaea and Bacteria. Beja et al. (2000a) decided to sequence a 130 kbp genomic fragment from a clone in which the 16 S probe had detected an rRNA sequence of an uncultivated member of marine Gammapro-teobacteria, the SAR86 group. Sequencing the rest of the clone revealed an ORF for a rhodopsin-like protein called proteorhodopsin, which showed similarity with rhodopsin genes from extreme halophilic Archaea and the fungus Neurospora crassa (Fig. 4.15).

Rhodopsins act as transmembrane channels that can bind the chromophore retinal (a derivative of vitamin A) to become sensitive to light. Absorption of light energy by the protein-retinal complex leads to a series of conformational shifts, promoting the transport of ions across the cell membrane. In the case of proton transporters, the outside surface of the cell membrane will become charged with protons and the resulting electrochemical membrane potential creates a motive force for another membrane-bound molecule, H + -ATPase, to drive ATP synthesis. Three groups of rho-dopsins are present in Archaea, one group acting as chloride pumps (halorhodopsins), another as proton pumps (bacteriorhodopsins), and the third

Bacteriorhodopsin H+ pump

0.10 Sensory rhodopsin

Figure 4.15 Unrooted phylogenetic tree of the proteorhodopsin sequence of an uncultured marine gammaproteobacterium found by Beja et al. (2000a), aligned with rhodopsins in Archaea and the fungus Neurospora crassa. HR, halothodopsin, light-driven chloride pumps; BR, bacteriorhodopsin, light-driven proton pumps; SR, sensory rhodopsin; Halsod, Halorubrum sodomense; Halhal, Halobacterium salinarum; Halval, Haloarcula vallismortis; Natpha, Natronomonas pharaonis; Halsp, Halobacterium sp. (all Archaea); Neucra, N. crassa (Ascomycota). The scale bar indicates the proportion of amino acid difference. Reprinted with permission from Beja et al. (2000a). Copyright 2000 AAAS.

0.10 Sensory rhodopsin

Figure 4.15 Unrooted phylogenetic tree of the proteorhodopsin sequence of an uncultured marine gammaproteobacterium found by Beja et al. (2000a), aligned with rhodopsins in Archaea and the fungus Neurospora crassa. HR, halothodopsin, light-driven chloride pumps; BR, bacteriorhodopsin, light-driven proton pumps; SR, sensory rhodopsin; Halsod, Halorubrum sodomense; Halhal, Halobacterium salinarum; Halval, Haloarcula vallismortis; Natpha, Natronomonas pharaonis; Halsp, Halobacterium sp. (all Archaea); Neucra, N. crassa (Ascomycota). The scale bar indicates the proportion of amino acid difference. Reprinted with permission from Beja et al. (2000a). Copyright 2000 AAAS.

as photosensory receptors (sensory rhodopsins). The last group of molecules is related to the opsin proteins found in eyes throughout the animal kingdom. In N. crassa a related opsin protein acts in the maintenance of circadian rhythmicity.

That the sequence found in the marine BAC clone represented a functional protein and not a pseudogene of some sort was proven by recombinant expression. E. coli cells, transfected with the rhodopsin sequence, expressed the protein and it was shown that a combination of retinal and yellow light triggered cross-membrane proton transport in E. coli cell suspensions (Fig. 4.16). Subsequent research using membrane preparations collected directly from seawater exposed to laser-flash photolysis demonstrated that similar photoactive molecules were very common in the environment (Beja et al. 2001).

The widespread occurrence of proteorhodopsins in the sea was confirmed by Venter et al. (2004). As mentioned in Section 4.2, these authors applied the WGS sequencing approach to microbial communities collected from the Sargasso Sea off the coast of Bermuda. The intention was to collect representative sequences from many diverse organisms simultaneously. Whereas the WGS approach is normally used to assemble a genome sequence for an individual species, the community WGS approach aimed to reconstruct as many genomes as possible from the mixture of genomes of varying abundance. In total 1.36 Gbp of microbial DNA sequence was generated, from at least 1800 genomic species, including 148 previously unknown bacterial phylotypes and 1.2 million previously unknown genes. Some distinct groups of sequence scaffolds could be distinguished, one clearly related to a Burkholderia species, others to Shewanella, Prochlorococcus, and a SAR86 gammaproteobacterium. The presence of Burkholderia, a nutritionally versatile genus of Betaproteobacteria,

Proteo- Retinal rhodopsin

Proteo- Retinal rhodopsin

5 min

Figure 4.16 Diagram showing the pH change in a medium with cell suspensions of E. coli expressing a proteorhodopsin. In the presence of both the protein and retinal, an outward transport of protons occurs when the cells are exposed to yellow light (>485 nm, indicated by On/Off), leading to a decrease in the pH of the medium. Reprinted with permission from Beja et al. (2000a). Copyright 2000 AAAS.

Figure 4.16 Diagram showing the pH change in a medium with cell suspensions of E. coli expressing a proteorhodopsin. In the presence of both the protein and retinal, an outward transport of protons occurs when the cells are exposed to yellow light (>485 nm, indicated by On/Off), leading to a decrease in the pH of the medium. Reprinted with permission from Beja et al. (2000a). Copyright 2000 AAAS.

was unexpected, because this genus was considered typical for terrestrial environments. Similarly, Shewanella is an abundant genus in aquatic, nutrient-rich environments. The presence of these organisms in the open ocean shows that they have a wider ecological amplitude than thought previously, or that there are nutrient-rich microhabitats (possibly associated with marine animals or anthropogenic waste) in which they may survive.

In the metagenome of the Sargasso Sea Venter et al. (2004) found 782 different rhodopsin-like genes, which were classified into 13 distinct subfamilies. Four of these families consisted of the archaeal, fungal, and sensory rhodopsins mentioned above, but nine families were related to sequences from uncultured species, including seven only known from the Sargasso Sea samples. Analysis of scaffolds containing both a taxonomic marker (ct subunit of RNA polymerase) and a rhodopsin gene demonstrated that rhodopsins are not limited to the Gammaproteobacteria in which Beja et al. (2000a) had first discovered them. For example, in one scaffold a rhodopsin was found together with a ct subunit RNA polymerase from the phylum Flavobacteria.

Bioinformatic analysis of the massive sequence data from the Sargasso Sea is by no means exhausted. It is expected that many more functional aspects of marine bacterial communities can be recovered from these data, while still more sequence information is expected to come from the Sorcerer II expedition. Table 4.7 provides an overview of some functional aspects of the Sargasso Sea sequence data. As an example, consider the presence of an ammonia monooxygenase gene sequence associated with an archaeal taxo-nomic marker, which indicates scope for archaeal nitrification in this environment. Previously, marine biologists had argued that nitrification in the ocean was hardly possible due to the sensitivity of chemoautotrophic bacteria to high levels of UV irradiation. Nitrification by Archaea would not be inhibited by UV light and this activity would be in accordance with the relatively high nitrite concentrations that are seen along with nitrate at certain times of the year in the Sargasso Sea.

In a commentary to the paper by Venter et al. (2004), Falkowksi and De Vargas (2004) remarked that the massive sequencing approach is reaching its limits when applied to community genomes. For example, despite the huge sequencing effort, only two nearly complete genomes (those of Burkholderia and Shewanella) could be reconstructed, and this could only be achieved by using already existing databases as a reference to support the assembly. The major part of the community is represented by rare organisms, and to obtain 95% coverage of these more than an order of magnitude of sequencing depth would be needed. Furthermore, if the approach was extended from prokaryote to eukaryote DNA the project would become much more problematic, because some dominant eukaryotes in seawater plankton (dino-flagellates, coccolithophorids) have extremely large genomes. Despite these obvious limitations, further WGS sequencing of marine communities

Table 4.7 List of some remarkable functional insights reconstructed from WGS sequencing of Sargasso Sea microbial DNA by Venter et al. (2004)

Genomic property Functional relevance

782 different rhodopsin genes belonging to 13 protein families

Rhodopsin gene in scaffold bearing a taxonomic anchor from the Flavobacteria/Cytophaga group Ammonia monooxygenase in archaeal-associated scaffold

Genes encoding phosphonate and high-affinity phosphate transporters; many genes responsible for utilization of pyrophosphates and polyphosphates Gene homologous to umuCD DNA damage-induced DNA polymerase of E. coll found on plasmid Genes for arsenate, mercury, copper, and cadmium resistance found on plasmids

At least 50 bacteriophage gene groupings in scaffolds and 150 in singletons

Rhodopsin-mediated phototrophy is very common in oceanic bacterial plankton

Rhodopsin-mediated phototrophy distributed well outside the Proteobacteria

Oceanic nitrification not limited to the Bacteria; there are nitrifiers among the Archaea, and ammonia oxidation is not inhibited by UV light

Versatile use of phosphorus compounds in oceanic environment to deal with severe phosphorus limitation

Resistance against UV damage by allowing DNA replication even when damaged by UV irradiation

Possible role of oceanic microorganisms in trace-metal cycling in an oligotrophic environment

High diversity of phages in oceanic bacterial community; significant fraction of bacteria infected will undoubtedly lead to new surprises and possibly new insights in oceanic functions, even though complete reconstruction of communities may remain beyond reach.

Another marine metagenomic study focused on viral communities (Breithart et al. 2002). Viruses represent a very important factor in biogeochem-ical cycles and microbial biodiversity; by means of transduction they interfere with the genomes of their hosts, which for the marine ecosystem are mostly bacteria and algae. Viral activities are also considered important drivers of microbial community diversity by 'killing the winner' and promoting growth conditions of species with low abundance (Weinbauer and Rassoulzadegan 2004). Obtaining an overview of the biodiversity of viruses is difficult, because these organisms do not possess universal taxonomic anchors like the 16 S rRNA gene in prokaryotes. The DNA polymerase gene pol can be used as a taxonomic marker for a subset of viruses (Short and Suttle 2002). New taxonomic systems are being developed that use all the genes in a viral genome to determine distances between species. This is elaborated in the Phage Proteomic Tree, a database and taxo-nomic algorithm for classifying bacteriophages

(Edwards and Rohwer 2005). However, not many viral genomes have actually been sequenced completely (Paul et al. 2002). Another issue is that viruses cannot be cultured outside their hosts, which in the case of marine bacteria are themselves mostly uncultured. So, a metagenomic approach to surveying viral biodiversity seems very appropriate.

In the study of Breithart et al. (2002) free-living viruses were collected by differential filtration and density-gradient centrifugation from surface sea water at two sites—Scripps Pier and Mission Bay—along the coast of California, USA. Special precautions must be taken when cloning viral DNA in a bacterial host, due to the presence of modified nucleotides and genes that could lyse the host. A total of 1934 sequences was obtained in a WGS approach, of which 70% showed no significant hits on sequences reported previously in GenBank. Among the remaining sequences no more than 34% were annotated in GenBank as viral sequences, and the rest were sequences of Archaea, Bacteria, and Eucarya, as well as mobile elements and repeat sequences. It appears that viral genomes carry a significant amount of DNA that originates from their hosts. About 83% of the viral sequences were related to bacteriophages, and these were classified further over the major groups of phages (Fig. 4.17). The viral community seemed to differ between the two sampling sites. Viral genomes at Scripps Pier were more 'bacterial' in origin, whereas viral genomes of Mission Bay had a more eukaryotic signature. Among the phage types, the Siphoviridae (l-type phages) were more dominant at Mission Bay than at Scripps Pier.

Why the viral community should differ between two sampling stations and whether there is any ecological relevance in such differences remains uncertain. A possibility could be that viral community composition is a reflection of eukaryotic versus prokaryotic dominance of the plankton, for example due to algal blooms of variable composition. Given the fact that the majority of the two marine communities appears to be uncharacter-ized, any conclusions in this direction stand on shaky ground. Like the Sargasso Sea study illustrated above, the community analysis of marine genomes is still in the exploratory stage.

Scripps Pier

Mission Bay

(a) Hits to genBank

Unknown (783)

(b) Biological groups

Unknown (783)

Known (278)

Virus (105)

Archaea (5)

Bacteria (89)

(c) Phage types

Unclassified (3)

Prophage Sipho

Unknown (569)

Unknown (569)

Known (304)

Archaea (2)

Virus (95)


Bacteria (71)

Repeat (70)

Eucarya (36) Mobile (30)

Unclassified (5) Prophage (13)

Sipho (22)

Sipho (22)

Figure 4.17 Overview of the content of viral genomes recovered from two marine coastal sampling stations, Scripps Pier and Mission Bay in California, USA. (a) A total of 1934 sequences were BLASTed to GenBank and 582 sequences produced a significant hit. (b) These sequences were classified according to sequence annotation, and 200 sequences were truly viral. (c) Among the 200 viral sequences 166 were from bacteriophages and these were classified according to the main phage families: Sipho, Siphoviridae (l-like); Podo, Podoviridae (T7-like); Myo, Myoviridae (T4-like); Micro, Microviridae (fX174-like). From Breitbart et al. (2002), by permission of the National Academy of Sciences of the United States of America.

4.4.2 The soil metagenome

Soil organisms have been most valuable sources of all kinds of natural products ever since the Scottish bacteriologist Alexander Fleming discovered in 1928 that the soil fungus Penicillium produced a substance that killed Staphylococcus bacteria. Many other products derived from microbial secondary metabolites have been used to develop antibiotics, anticancer drugs, fungicides, immunosuppressive agents, enzyme inhibitors, antiparasitic agents, herbicides, insecticides, and growth promoters. Over the years, most of the microorganisms that can be cultured in the laboratory have been examined thoroughly for the production of compounds with biological activity, and biotechnolo-gical investigators have gained the impression that the limits of what these organisms can yield in terms of valuable products have been reached. However, as we have seen above, any environment, and certainly the soil, holds a great diversity of uncultured microorganisms that remain to be investigated. With the advent of metagenomic recombinant DNA technology (Handelsman et al. 2002) it became technically feasible to screen soil microorganisms for new functionalities without culturing them. This opportunity has raised great expectations and renewed interest in gene mining. The soil has been likened to Lady Bountiful (Rondon et al. 1999) and is considered a rich source for the discovery of novel natural products (Lorenz and Schleper 2002; Cowan et al. 2004; Daniel 2004, 2005).

How high is the probability of finding a novel product by functional screening of a metagenomic library? Gabor et al. (2004) explored this question in a theoretical way by analysing existing genome sequences of microorganisms. Assuming a random approach to expression cloning, it was argued that the probability of isolating an expressed gene in a metagenomic library depends on the mechanism by which that gene is expressed. The minimal requirements for gene expression in a host include the presence of a promoter for transcription and a ribosome-binding site for initiation of translation. Both of these sites must be recognized by the expression machinery of the host. If expression involves trans-acting factors from the host—for example special transcription factors, inducers, etc.—or if modifying enzymes are necessary for the gene product to become functional, the situation becomes much more complicated. Calculations were made for three modes of expression to estimate the number of clones that would have to be screened before a target gene was recovered with a probability of greater than 90%. For the most simple case, independent expression, the expected number of clones was found to depend on the size of the insert and decreased to around 3000 with an insert size increasing to 100 kbp. It was also estimated that 40% of the genes can be found in this way. So, if metagenomic screening effort covers a library of several thousands clones, each with an insert of 100 kbp, there is a fair chance that a designated gene will be found.

A pioneering study in soil metagenomics was the work of Rondon et al. (2000). These authors developed BAC libraries with DNA isolated from agricultural soil in Wisconsin, USA. The largest library had 24 576 clones with an average insert size of 44.5 kbp, whereas 10% of the clones had an insert size of between 70 and 80 kb. It was estimated that this library contained 1000 Mbp of DNA; given an average density in microbial genomes of one gene per 1000 bp, about 1 million genes were expected to be present in the library. Screening of the library for biological activities employed a variety of strategies. For example, to find clones expressing the enzyme cellulase, plates with the host cells were overlaid with agar containing brilliant red hydroxyethyl cellulose, and a yellow halo around the colony was taken as an indicator of cellulase activity. In this way, a great variety of enzymatic activities were screened, including b lactamase, keratinase, chitinase, and amylase.

A remarkable discovery coming from the meta-genomic library screening conducted by Rondon et al. (2000) concerned a clone that had antibacterial activity against Bacillus subtilis and Staphylococcus aureus, but not against E. coli. The clone in which this activity was found was sequenced completely and it appeared to contain 29 ORFs, including a cluster of eight genes associated with phosphate transport. This showed that it is possible for BAC

clones to contain complete, intact, operons. Fourteen of the 29 ORFs could not be assigned a function. Using transposon mutagenesis, the genes were mutated to see which one was responsible for the antibacterial properties of the clone. Finally a single candidate gene was identified, of which the predicted amino acid sequence had several repeat units (Fig. 4.18). The authors also considered the hydrophobicity profile of this molecule. This is a plot of scores along the sequence, in which each amino acid is given a number indicating the degree of hydrophobicity (see Lesk 2002). The profile of the unknown protein showed seven peaks, which is indicative of a membrane pump (amino acids anchored in the membrane have a high score for hydrophobicity if they are to be embedded stably in the lipophilic environment of the membrane). Recombinant expression of the protein confirmed

0 0

Post a comment