Wednesday, September 29, 2010

A Genomics Genealogy

Deep sequencing technologies have radically changed how we study biology. Deciding what technology and software to use can be daunting. Choices become easier when the relationships between different DNA sequencing applications are understood.

A brief history 

DNA sequencing grew from our desire to understand how the instructions for the biochemistry of life are encoded in an organism’s DNA. If we know the precise ordering and organization of an organism’s DNA sequence, we can presumably unlock a code that reveals these instructions. Accomplishing this goal required the creation of a new field, molecular biology, and new technologies to sequence genes.

The first sequencing methods were arduous. They combined nuclease digestion with thin layer chromatography to measure di- and trinucleotides that could be puzzled together. Later, Maxim and Gilbert replaced enzymatic DNA degradation with a chemical fragmentation method that enabled the reading of ordered bases from 32P labeled fragments separated by electrophoresis.

The Sanger method, which used dideoxynucleotide triphosphates to create ensembles of DNA molecules terminated at each base, soon replaced Maxim Gilbert sequencing. The next innovation was to color code DNA with fluorescent dyes so that molecules could be interrogated with a laser and camera coupled to a computer. This innovation automated “high-throughput” DNA sequencing systems, initially with polyacrylamide gels and later with capillary electrophoresis, and made it possible to sequence the human and other genomes. It also created the first transcriptome analysis method, Expressed Tag Sequencing (EST).

Despite 20 years of advances, however, the high-throughput sequencing methods were not high-enough-throughput to realistically interrogate DNA and RNA molecules in creative ways. Big questions (genomes, ESTs, meta-genomes) required large factory-like approaches to automate sample preparation and collect sequences because a fundamental problem had yet to be solved. Specially, each sequence was obtained from an individual purified DNA clone or PCR product.

Real high-throughput is massively parallel throughput 

The next-generation DNA sequencing (NGS) technologies free researchers from the need to clone or purify every molecule. They all share the common innovation that DNA sequencing is performed in a massively parallel format. That is a library, or ensemble of millions of DNA molecules, are simultaneously sequenced. Data collection costs are dramatically decreased through miniaturization and by eliminating the need for warehouses of colony pickers, prep robots, sequencing instruments, and large teams of people.

The new problem is dealing with the data that are produced and increasing computation costs. As NGS opens new possibilities to measure DNA and RNA in novel ways, each application requires a specific laboratory procedure that must be coupled to a specific analysis methodology.

Sequencing genealogy is defined by the questions 

In an evolutionary model, the history of cloning, restriction site mapping, and Sanger sequencing form the trunk of the genomics application tree (top figure) from which branches develop as new applications emerge.

NGS has driven the evolution of three main sequencing branches: De Novo, Functional Genomics, and Variation Assays. The De Novo, or Exploratory, sequencing contains three subbranches that include new genomes (projects that seek to determine a complete genome sequence of an organism), meta-genomes (projects in which DNA fragments are sequenced from environmental samples), or meta-transcriptomes (projects where cDNA fragments are sequenced from environmental samples).

The Functional Genomics branch is growing fast. In these experiments, different collections of RNA or DNA molecules from an organism, tissue, or cells, are isolated and sequenced to measure gene expression and how it is regulated. Three subbranches describe the different kinds of function genomics: Expression, Regulation, and EpiGenomics, and each of these subbranches can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved) that are defined by laboratory protocols, kits, and instruments. When the experiments are refined and are made reproducible, they become assays.

Variation Assays form the third main branch of the tree. Genomic sequences are compared within and between populations to link genotype and phenotype. In special cases like cancer and immunology research, variation assays are used to observe changes within an organism’s somatic genomes over time. Today, variation, or resequencing, assays measure nucleotide and small insertions and deletions in whole genomes and exomes. If linked sequence strategies (mate-pairs, paired-ends) are used, larger structural changes including copy number variations can also be measured.

Why is this important?

As a software provider with both deep lab and analysis experience, we [Geospiza] are often asked questions about what instrument platform is the best or how our software stacks up against other available options. The answer, of course, depends on what you want to do. De Novo applications benefit from long reads offered by platforms like 454. Many of the assay-based applications demand ultra-deep sequencing with very high numbers of sequences (reads) as provided by the short-read platforms (Illumina, SOLiD). New single molecule sequencing platforms like PacBio's are targeting a wide rage of applications but have best been demonstrated, thus far, for long-read uses and novel methylation assays.

From an informatics perspective, the exploratory and assay-based branches have distinct software requirements. Exploratory applications require that reads be assembled into contigs that must be further ordered into scaffolds to get to the complete sequence. In meta-genomics or meta-transcriptomics applications, data are assembled to obtain gene sequences. These projects are further complicated by orthologous and paralogous sequences and highly expressed genes that over represent certain sequences. In these situations, specialized hardware or complex data reduction strategies are needed to make assembly practical. Once data are assembled, they are functionally annotated in a second computational phase using tools like BLAST.

Assay-based data analysis also has two distinct phases, but they are significantly different from De Novo sequencing. The first phase involves aligning (or mapping) reads to reference data sources and then reducing the aligned data into quantitative values. At least one reference is required and the better it is annotated the more informative the initial results will be. Alignment differs from assembly in that reads are separately compared to a reference rather than amongst themselves. Alignment processing capacity can be easily scaled with multiple inexpensive computers whereas assembly processing cannot.

The second phase of Assay-based sequencing is to produce a discrete output as defined by a diagnostic application, or compare the quantitative values computed from the alignments from several samples, obtained from different individuals and (or) treatments relative to controls. This phase requires statistical tools to normalize data, filter false positives and negatives, and measure differences. Assay-based applications become more informative when large numbers of samples and replicates are included in a study.

Connecting the dots 

While the sequencing applications can be grouped and summarized in different ways, they are also interrelated. For example, De Novo projects are open-ended and exploratory, but their end product, a well-annotated reference sequence, is the foundation for Functional Genomics and Variation applications. Variation analysis is only useful if we can assign function to specific genotypes. Functional assignments come, in part, from previous experiments and genomic annotations, but are increasingly being produced by sequencing assays, so the new challenge is integrating that data obtained from different assays into coherent datasets that can link many attributes to a set of genotypes.

NGS clearly opens new possibilities for studying and characterizing biological systems. Different applications require different sequencing platforms, laboratory procedures, and software systems that can organize analysis tools and automate data processing. On this last point, as one evaluates their projects and their options for being successful, they need to identify informatics groups that have deep experience, available solutions, and strong capabilities to meet the next challenges. Geospiza is one such group.

Further Reading

DNA Sequencing History

Gilbert W, Maxam A (1973) The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70:3581

Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74:560

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463-7

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321:674-9

Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cdna sequencing (expressed sequence tags) from a directionally cloned human infant brain cdna library. Nat Genet 4:373-80

International Human Genome Sequencing Consortium, 2001. “Initial sequencing and analysis of the human genome.” Nature 409, 860-921.
Venter J.C., Adams M.D., Myers E.W., et. al. 2001. “The sequence of the human genome.” Science 291, 1304-1351.


No comments: