FinchTalk: September 2010

Wednesday, September 29, 2010

A Genomics Genealogy

Deep sequencing technologies have radically changed how we study biology. Deciding what technology and software to use can be daunting. Choices become easier when the relationships between different DNA sequencing applications are understood.

A brief history

DNA sequencing grew from our desire to understand how the instructions for the biochemistry of life are encoded in an organism’s DNA. If we know the precise ordering and organization of an organism’s DNA sequence, we can presumably unlock a code that reveals these instructions. Accomplishing this goal required the creation of a new field, molecular biology, and new technologies to sequence genes.

The first sequencing methods were arduous. They combined nuclease digestion with thin layer chromatography to measure di- and trinucleotides that could be puzzled together. Later, Maxim and Gilbert replaced enzymatic DNA degradation with a chemical fragmentation method that enabled the reading of ordered bases from ³²P labeled fragments separated by electrophoresis.

The Sanger method, which used dideoxynucleotide triphosphates to create ensembles of DNA molecules terminated at each base, soon replaced Maxim Gilbert sequencing. The next innovation was to color code DNA with fluorescent dyes so that molecules could be interrogated with a laser and camera coupled to a computer. This innovation automated “high-throughput” DNA sequencing systems, initially with polyacrylamide gels and later with capillary electrophoresis, and made it possible to sequence the human and other genomes. It also created the first transcriptome analysis method, Expressed Tag Sequencing (EST).

Despite 20 years of advances, however, the high-throughput sequencing methods were not high-enough-throughput to realistically interrogate DNA and RNA molecules in creative ways. Big questions (genomes, ESTs, meta-genomes) required large factory-like approaches to automate sample preparation and collect sequences because a fundamental problem had yet to be solved. Specially, each sequence was obtained from an individual purified DNA clone or PCR product.

Real high-throughput is massively parallel throughput

The next-generation DNA sequencing (NGS) technologies free researchers from the need to clone or purify every molecule. They all share the common innovation that DNA sequencing is performed in a massively parallel format. That is a library, or ensemble of millions of DNA molecules, are simultaneously sequenced. Data collection costs are dramatically decreased through miniaturization and by eliminating the need for warehouses of colony pickers, prep robots, sequencing instruments, and large teams of people.

The new problem is dealing with the data that are produced and increasing computation costs. As NGS opens new possibilities to measure DNA and RNA in novel ways, each application requires a specific laboratory procedure that must be coupled to a specific analysis methodology.

Sequencing genealogy is defined by the questions

In an evolutionary model, the history of cloning, restriction site mapping, and Sanger sequencing form the trunk of the genomics application tree (top figure) from which branches develop as new applications emerge.

NGS has driven the evolution of three main sequencing branches: De Novo, Functional Genomics, and Variation Assays. The De Novo, or Exploratory, sequencing contains three subbranches that include new genomes (projects that seek to determine a complete genome sequence of an organism), meta-genomes (projects in which DNA fragments are sequenced from environmental samples), or meta-transcriptomes (projects where cDNA fragments are sequenced from environmental samples).

The Functional Genomics branch is growing fast. In these experiments, different collections of RNA or DNA molecules from an organism, tissue, or cells, are isolated and sequenced to measure gene expression and how it is regulated. Three subbranches describe the different kinds of function genomics: Expression, Regulation, and EpiGenomics, and each of these subbranches can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved) that are defined by laboratory protocols, kits, and instruments. When the experiments are refined and are made reproducible, they become assays.

Variation Assays form the third main branch of the tree. Genomic sequences are compared within and between populations to link genotype and phenotype. In special cases like cancer and immunology research, variation assays are used to observe changes within an organism’s somatic genomes over time. Today, variation, or resequencing, assays measure nucleotide and small insertions and deletions in whole genomes and exomes. If linked sequence strategies (mate-pairs, paired-ends) are used, larger structural changes including copy number variations can also be measured.

Why is this important?

As a software provider with both deep lab and analysis experience, we [Geospiza] are often asked questions about what instrument platform is the best or how our software stacks up against other available options. The answer, of course, depends on what you want to do. De Novo applications benefit from long reads offered by platforms like 454. Many of the assay-based applications demand ultra-deep sequencing with very high numbers of sequences (reads) as provided by the short-read platforms (Illumina, SOLiD). New single molecule sequencing platforms like PacBio's are targeting a wide rage of applications but have best been demonstrated, thus far, for long-read uses and novel methylation assays.

From an informatics perspective, the exploratory and assay-based branches have distinct software requirements. Exploratory applications require that reads be assembled into contigs that must be further ordered into scaffolds to get to the complete sequence. In meta-genomics or meta-transcriptomics applications, data are assembled to obtain gene sequences. These projects are further complicated by orthologous and paralogous sequences and highly expressed genes that over represent certain sequences. In these situations, specialized hardware or complex data reduction strategies are needed to make assembly practical. Once data are assembled, they are functionally annotated in a second computational phase using tools like BLAST.

Assay-based data analysis also has two distinct phases, but they are significantly different from De Novo sequencing. The first phase involves aligning (or mapping) reads to reference data sources and then reducing the aligned data into quantitative values. At least one reference is required and the better it is annotated the more informative the initial results will be. Alignment differs from assembly in that reads are separately compared to a reference rather than amongst themselves. Alignment processing capacity can be easily scaled with multiple inexpensive computers whereas assembly processing cannot.

The second phase of Assay-based sequencing is to produce a discrete output as defined by a diagnostic application, or compare the quantitative values computed from the alignments from several samples, obtained from different individuals and (or) treatments relative to controls. This phase requires statistical tools to normalize data, filter false positives and negatives, and measure differences. Assay-based applications become more informative when large numbers of samples and replicates are included in a study.

Connecting the dots

While the sequencing applications can be grouped and summarized in different ways, they are also interrelated. For example, De Novo projects are open-ended and exploratory, but their end product, a well-annotated reference sequence, is the foundation for Functional Genomics and Variation applications. Variation analysis is only useful if we can assign function to specific genotypes. Functional assignments come, in part, from previous experiments and genomic annotations, but are increasingly being produced by sequencing assays, so the new challenge is integrating that data obtained from different assays into coherent datasets that can link many attributes to a set of genotypes.

NGS clearly opens new possibilities for studying and characterizing biological systems. Different applications require different sequencing platforms, laboratory procedures, and software systems that can organize analysis tools and automate data processing. On this last point, as one evaluates their projects and their options for being successful, they need to identify informatics groups that have deep experience, available solutions, and strong capabilities to meet the next challenges. Geospiza is one such group.

Further Reading

DNA Sequencing History

Gilbert W, Maxam A (1973) The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70:3581

Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74:560

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463-7

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321:674-9

Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cdna sequencing (expressed sequence tags) from a directionally cloned human infant brain cdna library. Nat Genet 4:373-80

International Human Genome Sequencing Consortium, 2001. “Initial sequencing and analysis of the human genome.” Nature 409, 860-921.

Venter J.C., Adams M.D., Myers E.W., et. al. 2001. “The sequence of the human genome.” Science 291, 1304-1351.

FinchTalks

From Reads to Datasets Why Next Gen is Not Like Sanger
Expeditiously Exponential: Genome Standards in a New Era

Next Gen DNA Sequencing Is Not Sequencing DNA
Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Wednesday, September 22, 2010

Geospiza in the News

Our release that Eurofins MWG Operon has standardized their worldwide DNA sequencing operations on the GeneSifter LIMS is important for several reasons.

Most significantly, as noted in the press release, we were chosen because of our deep support for both Sanger and Next Generation DNA sequencing (NGS). Our support for Sanger has been developed and improved on since 1997. No other LIMS product provides a comparable depth when it comes to understanding how Sanger sequencing (and fragment analysis) services are supported in the laboratory, or how Sanger data should be summarized in groups, by instruments, or runs to make quality control assessments and decisions. Finally, GeneSifter is the only Sanger-based product that allows sequence editing and tracks versions of those edits. This feature is made possible through integration with our popular FinchTV program that also provides a jumping off point for analyzing individual sequences with BLAST.

The news also points out that Sanger sequencing continues to play an important role in the DNA sequencing ecosystem. No other sequencing technology is suitable for doing a large number of small things. Thus, researchers access to Sanger systems will continue to play an important role in confirming clones, sequencing individual PCR products, and validating the results of NGS experiments. As many labs sunset their Sanger equipment in favor of NGS equipment, services such as those provided by Eurofins MWG Operon will continue to grow in importance, and we are thrilled to help.

Of course, the GeneSifter LIMS (GSLE) does much more than Sanger sequencing. Our NGS support is highly regarded, along with our microarray support as discussed in recent blogs announcing the current GSLE release and our recent news about PacBio.

Other product strengths include GSLE's configurability and application programming interfaces (APIs). Supporting worldwide deployment means non-english speakers within labs need to be able to communicate and use their words in forms and instructions. Using GSLE's standard web interfaces groups can internationalize parts of the system with their own language terms and comments to help their work. Finally, using GSLE's APIs Eurofins MWG Operon has easily incorporated the system into their web site to provide their customers with a smooth experience.

Saturday, September 11, 2010

The Interface Needs an Interface

A recent publication on Galaxy proposes that it is the missing graphical interface for genomics. Let’s find out.

The tag line of Michael Schatz’s article inGenome Biology states, “The Galaxy package empowers regular users to perform rich DNA sequence analysis through a much-needed and user-friendly graphical web interface.” I would take this description and Schatz’s later comment, “the ambitious goal of Galaxy is to empower regular users to carry out their own computational analysis without having to be an expert in computational biology or computer science” to mean that someone, like a biologist, who does not have much bioinformatics or computer experience could use the system to analyze data from a microarray or next gen sequencing experiment.

The Galaxy package is a software framework running bioinformatics programs and assembling those programs into complex pipelines referred to as workflows. It employs a web-interface and ships with a collection of tools to get a biologist quickly up and running, with examples. Given that Galaxy targets bioinformatics, it is reasonable to assume that its regular users are biologists. So, the appropriate question would be, how much does a biologist have to know about computers to use the system?

To test this question I decided to install the package. I have a Mac, and as a biologist, whether I’m on a Mac or PC, I expect that, if I’m given a download option, the software will easy to download and can be installed using a double click installer program. Galaxy does not have this. Instead, it uses a command line tool (oh, I need to use terminal) that requires Mercurial (hg). Hmm, what’s Mercurial? Mercurial is a version control system that supports distributed development projects. This not quite what I expected, but I’ll give it a try. I go to the hg (someone has a chemistry sense of humor) site and without too much trouble find a Mac OS X package, which uses a double click installer program. I’m in luck - of course I’ll ignore the you might have to add export LC_ALL=en_US.UTF-8, and export LANG=en_US.UTF-8 to your ~/.profile file - hg installs and works.

Now back to my terminal, I ignore the python version check and path setup commands, and type hg clone http://www.bx.psu.edu/hg/galaxy galaxy_dist; things happen. I follow the rest of the instructions - cd galaxy_dist; sh setup.sh - finally I start galaxy with the sh run.sh command. I go to my web browser and type http://localhost:8080 and galaxy is running! Kudos to the galaxy team for making a typically complicated process relatively simple. I’m also glad that I had none of the possible documented problems. However, to get this far, I had to tap into my unix experience.

With Galaxy running, I can now see if Schatz’s claims stand up. What should I do? The left hand menu gives me a huge number of choices. There are 31 categories that organize input/output functions, file manipulation tools, graphing tools, statistical analysis tools, analysis tools, NGS tools, and SNP tools, perhaps 200 choices of things to do. I’ll start with something simple like displaying the quality values in an Illumina NGS file. To do this, I click on “upload file” under the get data menu. Wow! There are 56 choices of file formats - and 17 have explanations. Fortunately there is an auto-detect. I leave that option, go the choose file button to select an NGS file on my hard drive and load it in. I’ll ignore the comment that files greater than 2GB should be uploaded by an http/ftp URL, because I don’t know what they are talking about. Instead I’ll make a small test file with a few thousand reads. I’ll also ignore the URL/text box and choice to convert spaces to tabs and the genome menu that seems to have hundreds of genomes loaded as these options have nothing to do with a fastq file. I’ll assume “execute” means “save” and click it.

After clicking execute some activity appears in the right hand menu indicating that my file is being uploaded. After a few minutes, my NGS file is in the system. To look at quality information, I select the “NGS: QC and manipulation” menu to find a tool. There are 18 options for tools to split files, join files, convert files, and convert quality data in files; this stuff is complicated. Since all I want to do is start with creating some summary statics, I find and select "FASTQ summary statistics." This opens a page in the main window where I can select the file that I uploaded and click the execute button to generate a big 20 column table that contains one row per base in the reads. The columns contain information about the frequency of bases and statistical values derived from the quality values in the file. These data are displayed in a text table that is hard to read, so the next step is to graphically view the data in histogram and box plots.

Graphing tools are listed under a different menu, “Graph/Display Data.” I like box plots, so I’ll select that choice. In the main window I select my summary stats file, create a title for the plot, set the plot’s dimensions (in pixels), define x and y axes titles, and select the columns from the big table that contains the appropriate data. I click the execute button to create files containing the graphs. Oops, I get an error message. It says “/bin/sh gnuplot command not found.” I have to install gnuplot. To get gnuplot going I have to download source, compile the package, and install. To do this I will need developer tools installed along with gnuplot’s other dependencies for image drawing. This is getting to be more work than I bargained for ...

When Schatz said “regular user” he must have meant unix savvy biologist that understands bioinformatics terminology, file formats, and other conventions, and can install software from source code.

Alternatively, I can upload my data into GeneSifter, select the QC analysis pipeline, navigate to the file summary page, and click the view results link. After all, GeneSifter was designed by biologists for biologists.

Thursday, September 2, 2010

Geospiza, SAIC, and PacBio

On Tue. 8/31, we release news about our collaboration with SAIC whereby we will enhance GeneSifter's support for the PacBio RS, Pacific Biosciences’ Single Molecule Real Time (SMRT™) sequencing system.

Geospiza is excited to work with the group at SAIC-Frederick. SAIC-Frederick is the operations and technical support contractor for the National Cancer Institute's research, so the effort will ultimately benefit researchers at NCI. Because several of our customers are part of PacBio's early access effort we will be able to get a broad view of PacBio RS platform's strengths and applications to support our many other customers who are interested in getting the system.

Most importantly, this news demonstrates our commitment to supporting all of the sequencing platforms. We already have great support for the Illumina GA and HiSeq, SOLiD, and 454. PacBio, IonTorrent can be supported through our configuration interfaces, and new releases will include specialized features to enhance our customers' experiences working with each instrument's unique qualities.

Look for announcements as new releases of GeneSifter Lab Edition and Analysis Edition roll out over the coming months. In the meantime, check out the full release including a nice quote from Eric Schadt, PacBio's CSO.