I

Figure 2.9 Scheme of the hierarchical approach to whole-genome sequencing.

several gaps, which have to be filled in by further sequencing. This is often done by sequencing both ends of a collection of clones (end sequencing) and looking for identity with parts already sequenced. If an end sequence happens to fall into an already sequenced contig that is adjacent to a gap, there is a good chance that the other end of the clone extends into the gap. In this way an attempt is made to fill in all gaps. Gap closure can also be supported by sequence information from other sources, such as existing cDNAs or ESTs. Finally, comparison of the sequence-based physical maps with the corresponding genetic map (if available) can also be of high value. The gene order in the genetic map should correspond to the gene order in the physical map, although the genetic map may locally expand or contract the physical map due to unequal rates of recombination across the chromosome. The final result is a sequence assembly of the entire genome. This usually still requires further editing to remove errors. For example, after publication of the draft sequence of the human genome in early 2001, it took another 3 years before the sequence (and only the euchromatin part of it)

was considered to be 99% complete, in October 2004.

An interesting aspect of the hierarchical approach to genome sequencing is that the work can be distributed among laboratories, each focusing on designated parts of a chromosome or on a certain collection of BACs. Sequencing the yeast genome is the prime example of a project that was mostly completed using the hierarchical approach. Started in 1989, a group of 35 laboratories embarked on the task of sequencing chromosome III, which was completed in 1992. Then in the meantime new projects were formulated, which led to collaboration between 92 laboratories over the years, involving 600 committed scientists, until the completion of the sequence was announced in 1996 (Goffeau et al. 1996). Looking back, Dujon (1996) mentioned that two aspects are critical in a genome programme: construction of clone libraries 'upstream' of the sequencing and quality control of the sequence 'downstream' of the sequencing. The average accuracy of the yeast genome at the time was estimated as 99.9%, which seems a high figure, but Dujon (1996) noted that even this figure allows only one-third of all protein-coding genes in the yeast genome to be completely error-free. With a sequence accuracy of 99.99% the proportion of completely error-free proteins rises to 85%.

2.2.3 Whole-genome shotgun (WGS) sequencing

The principle of WGS sequencing was introduced in 1995, when the genome of the bacterium H. influenzae was published (Fleischmann et al. 1995). The term shotgun evokes the image of a cloud of shot fired at short range to hit the genome more or less at random. The strategy is to skip the ordering of clones and the construction of physical maps and to just sequence clones in random order until it may be assumed that all genomic fragments have been covered at least once (Fig. 2.10). The average number of times that a fragment is sequenced is called the depth of coverage. The idea is that the likelihood that a segment is not represented at all should be as small as possible by increasing the mean coverage. It may be assumed that the probability of a base position being sequenced r times, P(r), follows a Poisson distribution, which is given by:

where - is the mean depth of coverage. When the genome size is G and the sequencing has delivered N bases, - = N/ G. The probability that a base is then still not sequenced is

With 6-fold coverage, the expected fraction of bases not yet sequenced is 0.00248, or 0.25% of the genome. So, even with a high degree of redundancy, there will always remain gaps in the genome sequence; increasing the sequencing effort helps very little after 5-fold coverage because of the principle of diminishing returns inherent in the exponential function.

The theory of WGS sequencing goes back to Lander and Waterman (1988). The principle is that the preparation of genome fragments is essentially random, which is approximated by applying shearing, rather than enzymatic digestion of DNA,

Chromosomes =1 1=

H I I I I I I I I Assemble to contigs

Assemble to scaffolds, map to chromosome

0 0

Post a comment