FinchTalk: September 2008

Quiz: What can sequence small genomes in a single run? What can more than double or triple the EST database for any organism?
Answer: The Roche (454) Genome Sequencer FLX™ System.

Last week I had the pleasure of attending the Roche 454 users conference where the new release (Titanium) of the 454 sequencer was highlighted . This upgrade produces more, longer reads so that more than 600 million bases can be generated in each run. When compared to previous versions, the FLX Titanium produces about five times more data. The conference was well attended and outstanding with informative presentations on science, technology, and practical experiences.

In the morning of the first full day, Bill Farmerie, from the University of Florida, presented on how he got into DNA sequencing as a service and how he sees Next Gen sequencing changing the core lab environment. Back in 1998 he set out to establish a genomics service and talked to many groups about what to do. They told him two important things:

"Don't sweat the sequencing part - this is what we are trained for."
"Worry about information management - this we are not trained for."

From here, he discussed how Next Gen got started in his lab and related his experiences over the past three years and made these points:

The first two messages are still true. Sequencing gets solved, the problem is informatics.
DNA sequencing is expanding, more data are being produced faster at lower costs.
This is democratizing genomics - many groups now have access to high throughput technology that provides "genome center" capabilities.
The next bioinformatics challenge is enabling the research community, the groups with the sequencing projects, to make use of their data and information. This is not like Sanger, core labs need to deliver results with data.
The way to approach new problems and increase scale is to relieve bioinformatics staff of the burden of doing routine things so they can focus on developing novel applications.
To accomplish the above point, buy what you can and build what you have to.

Other speakers made similar points. The informatics challenge begins in the lab, but quickly becomes a major problem for the end researcher.

Bill has been following his points successfully for many years now. We starting working with him on his first genomics service and continue to support his lab with Next Gen. Our relationship with Bill and his group has been a great experience.

Other highlights from the meeting included:

A talk on continuous process improvements in DNA sequencing at the Broad Institute. Danielle Perrin presented work on how the Broad tackles process optimization issues during production to increase throughput, decrease errors, or save costs. In my perspective, this presentation really stresses the importance of coupling laboratory management with data analysis.

Multiple talks on microbial genomics. A strength of the 454 platform is how it generates long reads making this a platform of choice for sequencing smaller genomes and performing metagenomic surveys. We were also introduced to the RAST (Rapid Annotation using Subsystem Technology) server, an ideal tool for working with your completed genome or metagenome data set.

Many examples of how having millions of reads makes new gene expression and variation analysis discoveries possible when compared to other platforms like microarrays. In these talks speakers were occasionally asked which is better, long 454 reads or short reads from Illumina or SOLiD? The speakers typically said you need both, they complement each other.

The Wolly Mammoth. Steven Schuster from Penn State presented his and colleagues' work on sequencing mammoth DNA and its relatedness over 1000's of years. Next Gen is giving us a new "omics," Museomics.

And, of course, our poster demonstrating how FinchLab provides an end to end workflow solution for 454 DNA sequencing. In the poster (you have to click the image to get the BIG picture), we highlighted some new features coming out at the end of the month. These include the ability to collect custom data during lab processing, coupling Excel to FinchLab forms, and work on 454 data analysis. Now you will be able to enter the bead counts, agarose images, or whatever else you need to track lab details to make those continuous process improvements. Excel coupling makes data entry though FinchLab forms even easier. The 454 data analysis complements our work with Sanger, SOLiD, and Illumina data to make the FinchLab platform complete for any genomics lab.

In Next Gen experiments, libraries of DNA fragments are created in different ways, from different samples, and sequenced in a massively parallel format. The preparation of libraries is a key step in these experiments. Understanding and validating the results requires knowing how the libraries were created and where the samples came from.

Background

In the last post, I introduced the concept that nearly all Next Gen sequencing applications are fundamentally quantitative assays that utilize DNA sequences as data points.

In Sanger sequencing, the new DNA molecules are synthesized, beginning at a single starting point determined by the primer. If the sequencing primer binds to heterogeneous molecules that contain the same binding site, for example, two slightly different viruses in a mixed population, a single read from Sanger sequencing could represent a mixture of many different molecules in the population, with multiple bases at certain positions. Next Gen sequencing, on the other hand, produces single reads from single individual molecules. This difference between the two methods allows one to simultaneously collect millions of sequence reads in a massively parallel format from single samples.

An additional benefit of massively parallel sequencing is that it eliminates the need to clone DNA, or create numerous PCR products. Although this change reduces the complexity of tracking samples, it increases the need to track experiments with greater detail and think about how we work with the data, how we analyze the data, and how we validate our observations to generate hypotheses, make discoveries, and identify new kinds of systematic artifacts.

Making Libraries

To better understand the significance of what a Next Gen experiment measures, we need to understand what DNA libraries are and how they are prepared. For this discussion we'll define a DNA library as a random collection of DNA molecules (or fragments) that can be separated and identified.

Before we do any kind of Next Gen experiment, we want to know something about the kinds of results we’d expect to see from our library. To begin, let’s consider what we would see from a genomic library consisting of EcoRI restriction fragments. If the digestion is complete, EcoRI will cut DNA between an G and A every time it encounters the sequence: 5'-GAATTC-3'. Every fragment in this library would have the sequence 5'-AATT-3' at every 5’ end. The average length of the fragments will be 4096 bases (~5 kbp). However, the distribution of fragment lengths follows Poisson statistics [1], so the actual library will have a few very large fragments (>> 5 kbp) and numerous small fragments

You may ask “why is this useful?”

Our EcoRI library example helps us to think about our expectations for Next Gen experimental results. That is, if we collect 10 million reads from a sample, what should we expect to see when we compare our data to reference data? We need to know what kinds of results to expect in order to determine if our data represent discoveries, or artifacts. Artifacts can be introduced during sample preparation, sample tracking, library preparation, or from the data collection instruments. If we can’t distinguish between artifacts and discoveries, the artifacts will slow us down and lead to risky publications.

In the case of our EcoRI digest, we can use our predictions to validate our results. If we collected sequences from the estimated 732,000 fragments and aligned the resulting data back to a reference genome, we would expect to see blocks of aligned reads at every one of the 732,000 restriction sites. Further, for each site there should be two blocks, one showing matches to the "forward" strand and one showing matches to the "reverse" strand.

We could also validate our data set by identifying the positions of EcoRI restriction sites in our reference data. What we'd likely see is that most things work perfectly. In some cases, however, we would also see alignments, but no evidence of a restriction site. In other cases, we would see a restriction site in the reference genome, but no alignments. These deviations would identify differences between the reference sequence and the sequence of the genome we used for the experiment. Those differences could either result from errors in the sequence of the reference data or a true biological difference. In the latter case, we would examine the bases and confirm the presence of a restriction length fragment polymorphism (RFLPs). From this example, we can see how we can define the expected results, and use that prediction to validate our data and determine whether our results correspond to interesting biology or experimental error.

Digital Gene Expression

Of course what we expect to see in the data is a function of the kind of experiment we are trying to do. To illustrate this point I'll compare two different kinds of Next Gen experiments that are both used to measure gene expression: Tag Profiling and RNA-Seq.

In Tag Profiling, mRNA is attached to a bead, converted to cDNA, and digested with restriction enzymes. The single fragments that remain attached to the beads are isolated and ligated to adaptor molecules, each one containing a type II restriction site. The fragments are further digested with the type II restriction enzyme and ligated to a sequencing adaptor to create a library of cDNA ends with 17 unique bases, or tags. Sequencing such a library will, in theory, yield a collection of reads that represents the population of RNA molecules in the starting material. Highly expressed genes will be represented by a larger number of tagged sequences than genes expressed at lower levels.

Both Tag profiling and RNA-Seq begin with an mRNA purification step, but after that point the procedures differ. Rather than synthesize a single full-length cDNA for every transcript, RNA-Seq uses random six-base primers to initiate cDNA synthesis at many different positions in each RNA molecule. Because these primers represent every combination of six base sequences, priming with these sequences produces a collection of overlapping cDNA molecules. Starting points for DNA synthesis will be randomly distributed, giving high sequence coverage for each mRNA in the starting material. Like Tag Profiling, genes expressed at high levels will have more sequences present in the data than genes expressed at low levels. Unlike Tag Profiling, any single transcript will produce several cDNAs aligning at different locations.

When the sequence data sets for Tag Profiling and RNA-seq are compared, we can see how the different methods for preparing the DNA libraries contrast with one another. In this example, Tag Profiling [2] and RNA-seq [3] data sets were aligned to human mRNA reference sequences (RefSeq, NCBI). The data were processed with Maq [4] and results displayed in FinchLab. In both cases, relative gene expression can be estimated by the number of sequences that align. If we know the origins of the libraries, the kinds of genes and their expression can give us confidence that the results fit the expression profile we expect. For example the RNA-seq data set is from mouse brain and we see genes at the top of the list that we expect to be expressed in this kind of tissue (last figure below).

The Tag Profiling and RNA-seq data sets also show striking differences that reflect how the libraries are prepared. In each report, the second column gives information about the distribution of alignments in the reference sequence. For Tag Profiling this is reported as "Tags." The number of Tags corresponds to the number of positions along the reference sequence where the tagged sequences align. In an ideal system, we would expect one tag per molecule of RNA. Next Gen experiments however, are very sensitive, so we can also see tags for incomplete digests. Additionally, sequencing errors, and high mismatch tolerance in the alignments can sometimes place reads incorrectly and give unusually high numbers of tags. When the data are more closely examined, we do see that the distribution of alignments follows our expectations more closely. That is, we generally see a high number of reads at one site, with the other tag sites showing a low number of aligned reads.

For RNA-seq, on the other hand, we display the second column (Read Map) as an alignment graph. For RNA-seq data, we expect that the number of alignment start points will be very high and randomly distributed throughout the sequence. We can see that this expectation matches our results by examining the thumbnail plots. In the Read Map graphs, the x-axis represents the gene length and the y-axis is the base density. Presently, all graphs have their data plotted on a normalized x-axis, so the length of an mRNA sequence corresponds to the density of data points in the graph. Longer genes have points that are closer together. You can also see gaps in the plots; some are internal and many are at the 3'-end of the genes. When the alignments are examined more closely, and we incorporate our knowledge of the exon structure or polyA addition sites, we can see that many of these gaps either show potential sites for alternative splicing or data annotation issues.

In summary, Next Gen experiments use DNA sequencing to identify and count molecules, from libraries, in a massively parallel format. The preparation of the libraries allows us to define expected outcomes for the experiment and choose methods for validating the resulting data. FinchLab makes use of this information to display data in ways that make it easy to quickly observe results from millions of sequence data points. With these high-level views and links to drill down reports and external resources, FinchLab provides researchers with the tools needed to determine whether their experiments are on track to creating new insights, or if new approaches are needed to avoid artifacts.

References

[1] The distribution of restriction enzyme sites in Escherichia coli. G A Churchill, D L Daniels, and M S Waterman. Nucleic Acids Res. 1990 February 11; 18(3): 589–597.

[2] Tag Profile dataset was obtained from Illumina.

[3] Mapping and quantifying mammalian transcriptomes by RNA-Seq. A Mortazavi, BA Williams, K McCue K, L Schaeffer, B Wold. Nat Methods. 2008 Jul;5(7):621-8. Epub 2008 May 30.
Data available at: http://woldlab.caltech.edu/rnaseq/

[4] Mapping short DNA sequencing reads and calling variants using mapping quality scores. H Li, J Ruan, R Durbin. Genome Res. 2008 Aug 19. [Epub ahead of print]

FinchTalk

Thursday, September 18, 2008

Road Trip: 454 Users Conference

Thursday, September 4, 2008

The Ends Justify the DNA