Monday, March 24, 2008

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Next generation sequencing will transform sequencing assays and experiments. Understanding how the data are generated is important for interpreting results.

In my last post, I discussed how all measurement systems have uncertainty (error) and how error probability is determined in Sanger sequencing. In this post, I discuss the current common next generation (Next Gen) technologies and their measurement uncertainty.

The Levels of Data - To talk about Next Gen data and errors, it is useful to have a framework for describing these data. In 2005, the late Jim Gray and colleagues published a report addressing scientific data management in the coming decade. The report defines data in three levels (Level 0, 1, and 2). Raw data (Level 0), the immediate product of analytical instruments, must be calibrated and transformed into Level 1 data. Level 1 data are then combined with other data to facilitate analysis (Level 2 datasets). Interesting science happens when we make different combinations of Level 2 data and have references to Level 1 data for verification.

How does our data level framework apply to the DNA sequencing world?

In Sanger sequencing, Level 0 data are the raw analog signal data that are collected from the laser and converted to digital information. Digital signals, displayed as waveforms, are mobility corrected and basecalled to produce Level 1 data files that contain information about the collection event, the DNA sequence (read), quality values, and averaged intensity signals. Read sequences are then combined together, or with reference data, to produce Level 2 data in the form of aligned sequence datasets. These Level 2 data have many context specific meanings. They could be a list of annotations, a density plot in a genome browser view, an assembly layout, or a panel of discrepancies that use quality values (from Level 1) to distinguish true genetic variation from errors (uncertainty) in the data.

Next Gen data are different.

When the "levels of data" framework is used to explore Next Gen sequencing technology, we see fundamental differences between Level 0 and Level 1 data . In Sanger sequencing, we perform a reaction to create a mixture of fluorescently tagged molecules that differ in length by a single base. The mixture is then resolved by electrophoresis with continual detection of these size separated tagged DNA molecules to ultimately create a DNA sequence. Uncertainties in the basecalls are related to electrophoresis conditions, compressions due to DNA secondary structure, and the fact that some positions have mixed bases because the sequencing reactions contain a collection of molecules.

In Next Gen sequencing, molecules are no longer separated by size. Sequencing is done "in place" on DNA molecules that were amplified from single molecules. These amplified DNA molecules are either anchored to beads (that are later randomly bound to slides or picoliter wells) or anchored to random locations on slides. Next Gen reactions then involve multiple cycles of chemical reaction (flow) followed by detection. The sample preparation methods and reaction/detection cycles are where the current Next Gen technologies greatly differ and have their unique error profiles.

Next Gen Level 0 data changes from electropherograms to huge collections of images produced through numerous cycles of base (or DNA) extensions and image analysis. Unlike Sanger sequencing, where both raw and processed data can be stored relatively easily in single files, Next Gen Level 0 data sets can be quite large and multiple terabytes can be created per run. Presently there is active debate on the virtues of storing Level 0 data (the Data Life Cycle). The Level 0 image collections are used to create Level 1 data. Level 1 data are a continuum of information that are ultimately used to produce reads and quality values. Next Gen quality values reflect the basic underlying sequencing chemistry and primary error modes, and thus calculations need to be optimized for each platform. For those of you who are familiar with phred, these are analogous to the settings in the phredpar.dat file. The other feature of Level 1 data are that they can be expressed differently.


The Spaces

Flow Space - The Roche 454 technology, is based on pyrosequencing [1] and the measured signals are a function of how many bases are incorporated in a base addition cycle. In pyrosequencing, we measure the pyrophosphate (PPi) that is released when a nucleotide is added to a growing DNA chain. If multiple bases are added in a cycle, two or more A's for example, proportionally more (PPi) is released and the light detected is more intense. As more bases are added in a row, the relative increase in light decreases; an 11/10 change ratio for example, is much lower than 2/1. Consequently, when there are longer sequences with the same base (e.g. AAAAA), it becomes harder to count the number of bases accurately, and the error rate increases. Flow space describes a sequence in terms of base incorporations. That is, the data are represented as an ordered string of bases plus the number of bases at each base site. The 454 software performs alignment calculations in flow space.

Color Space - The Applied Biosystems SOLiD technology uses DNA ligation [2] with collections of synthetic DNA molecules (oligos) that contain two nested known bases with a fluorescent dye. In each cycle these two bases are read at intervals of five bases. Getting full sequence coverage (25, 35, or more bases), requires multiple ligation and priming cycles such that each base is sequenced twice. The SOLiD technology uses this redundancy to decrease the error probability. The Level 1 sequence is reported in a “color” space, where the numbers 0,1,2,3 are used to represent one of the fluorescent colors. Color space must be decoded into a DNA sequence at the final stage of data processing. Like flow space, it is best to perform alignments using data in color space. Since each color represents two bases, decoding color space requires that the first base be known. This base came from the adapter.

Sequence Space - Illumina’s Solexa technology uses single base extensions of fluorescent-labeled nucleotides with protected 3'-OH groups. After base addition and detection, the 3'-OH is deprotected and the cycle repeated. The error probabilities are calculated in different ways by first analyzing the raw intensity files (remember firecrest and bustard) and then by alignment with the ELAND program to compute an empirical probability error. With Solexa data, errors occur more frequently at the ends of reads.

So, what does this all mean?

Next Gen technologies are all different in the way data are produced, they all produce different kinds of Level 1 data, and the Level 1 data are best analyzed within their own unique "space." This can have data integration implications depending on how you might want to combine data sets (Level 1 vs Level 2). Clearly, it is important to have easy access to quality values and error rates to help troubleshoot issues with runs and samples. The challenge is sifting out the important data from a morass of sequence files, quality files, data encodings, and other vagaries of the instruments and their software. Geospiza is good at this and we can help.

In terms of science, many assays and experiments, like Tag and Count, or whole genome assembly, can use redundancy to overcome random data errors. Data redundancy can also be used to develop rules to validate variations in DNA sequences. But, this is the tip of the iceberg. With Next Gen's extremely high sampling rates and molecular resolution, we can begin to think of finding very rare differences in complex mixtures of DNA molecules and use sequencing as a quantitative assay. In these cases understanding how the data are created and their corresponding uncertainties are a necessary step toward making full use of the data. The good news is that there is some good work being done in this area.

Remember, the differences we measure must be greater than the combined uncertainty of our measurements.

Further reading
1. Ronaghi, M., M. Uhlen, and P. Nyren, "A sequencing method based on real-time pyrophosphate." Science, 1998. 281(5375): p. 363, 365.

2. Shendure, J., G.J. Porreca, N.B. Reppas, et al., "Accurate multiplex polony sequencing of an evolved bacterial genome." Science, 2005. 309(5741): p. 1728-32.

3. "Quality scores and SNP detection in sequencing-by-synthesis systems." http://www.genome.org/cgi/content/abstract/gr.070227.107v2

4. "Scientific data management in the coming decade." http://research.microsoft.com/research/pubs/view.aspx?tr_id=860

5. Solexa quality values. http://rcdev.umassmed.edu/pipeline/Alignment%20Scoring%20Guide%20and%20FAQ.html
http://rcdev.umassmed.edu/pipeline/What%20do%20the%20different%20files%20in%20an%20analysis%20directory%20mean.html

1 comment:

Anonymous said...

HI,
I'm a bioinfomatician working with 454. I think you articles are very usefull and interesting for me.