## Monday, March 17, 2008

### Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Next generation DNA sequencing introduces new concepts like color space, flow space, and sequence space. You might ask, what's a space? How do I deal with these spaces? Why are they important?

In this two part blog, I will first talk about error analysis in DNA sequencing. Next I will talk about how we might think about error analysis in next generation sequencing.

Last week I came across a story about an MIT physics professor, Walter Lewin, who captivates his student audiences with his lectures and creative demonstrations. MIT and iTunes have 100 of his lectures on line. I checked out the first one - your basic first college physics lecture that focuses on measurement and dimensional analysis - and agree, Lewin is captivating. I watched the entire lecture, and it made me think about DNA sequencing.

In the lecture, Lewin, proves "physics works!" and how his grandmother was right when she said that you are inch taller when laying down than when standing up. He used a student subject and measured his length laying down and standing up. Sure enough, the student was an inch longer laying down. But that was not the point. The point was - Lewin proved his grandmother was right because the change in the student's length was greater than the uncertainty of his measuring device (the ruler). Every measurement we make has uncertainty, or error, and for a comparison to be valid the difference in measures have to be greater than their combined uncertainties.

What does this have to do with DNA sequencing?

Each time we collect DNA sequence data we are making many measurements. That is, we are determining the bases of a DNA sample template in an in vitro replication process that allows us to "read" each base of the sequence. The measurements we collect, the string of DNA bases, therefore have uncertainty. We call this uncertainty in base measurement the error probability. In Sanger sequencing, Phil Green and Brent Ewing developed the Phred basecalling algorithm to measure per base error probabilities.

Error probabilities are small numbers (1/100, 1/10,000, 1/1,000,000). Rather than work with small fractions and decimal values with many leading zeros, we express error probabilities as positive whole integers, called quality values (QVs), by applying a transformation:

QV = -l0*log(P), where P is the error probability.

With this transformation our 1/100, 1/10,000, and 1/1,000,000 error probabilities become QVs of 20, 40, and 60, respectively.

The Phred basecalling algorithm has had a significant impact on DNA sequencing because it demonstrated that we could systematically measure the uncertainty of each base determination in a DNA sequence. Over the past 10 years, Phred quality values have been calibrated through many resequencing projects and are thus statistically powerful. An issue with Phred, and any basecaller, however is that it must be calibrated for different electrophoresis instruments (measurement devices) and that is why different errors and error rates can be observed with different combinations of basecallers and instruments.

Sequencing redundancy also reduces error probabilities

The gold standard in DNA sequencing is to sequence both strands of a DNA molecule. This is for good reason. Each stand represents an independent measurement. If our measurements agree, they can be better trusted, and if they disagree one needs to look more closely at the underlying data, or remeasure. This concept was also incorporated into Green's assembly program Phrap (unpublished).

Within the high throughput genomics community it is well understood that increasing the redundancy of data collection reduces error. In theory, one can automate the interpretation of DNA sequencing experiments, or assays, by collecting data at sufficient redundancy. The converse is also true, and I see people work the hardest with manually reviewing data when they do not collect enough. This is most common with variant detection resequencing assays.

Why isn't high redundancy data collection routine?

The challenges with high redundancy data collection in Sanger sequencing involve the high relative costs of collecting data and higher costs of collecting data from single molecules. Next generation (Next Gen) sequencing changes this landscape.

The higher throughput rates and lower costs of Next Gen sequencing hold great promise for revolutionizing genomics research and molecular diagnostics. In a single instrument run, an Expression Sequence Tag (EST) experiment can yield millions of sequences and detect rare transcripts that cannot be found any other way [1-3]. In cancer research, high sampling rates will allow for the detection of rare sequence variants in populations of tumor cells that could be prognostic indicators or provide insights for new therapeutics [1, 4, 5]. In viral assays, it will be possible to determine the sequence of individual viral genomes and detect drug resistant strains as they appear [6, 7]. Next Gen sequencing has considerable appeal because the large numbers of sequences that can be obtained make statistical calculations more valid.

Making statistical calculations valid, however, requires that we understand the inherit uncertainty of our measuring device. In this case, the different Next Gen genetic analyzers. That's where color space, flow space, and other spaces come into play.

1. Meyer, M., U. Stenzel, S. Myles, K. Prufer, and M. Hofreiter, Targeted high-throughput sequencing of tagged nucleic acid samples. Nucleic Acids Res, 2007. 35(15): p. e97.
2. Korbel, J.O., A.E. Urban, J.P. Affourtit, et al., Paired-end mapping reveals extensive structural variation in the human genome. Science, 2007. 318(5849): p. 420-6.
3. Wicker, T., E. Schlagenhauf, A. Graner, T.J. Close, B. Keller, and N. Stein, 454 sequencing put to the test using the complex genome of barley. BMC Genomics, 2006. 7: p. 275.
4. Taylor, K.H., R.S. Kramer, J.W. Davis, J. Guo, D.J. Duff, D. Xu, C.W. Caldwell, and H. Shi, Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res, 2007. 67(18): p. 8511-8.
5. Highlander, S.K., K.G. Hulten, X. Qin, et al., Subtle genetic changes enhance virulence of methicillin resistant and sensitive Staphylococcus aureus. BMC Microbiol, 2007. 7(1): p. 99.
6. Wang, G.P., A. Ciuffi, J. Leipzig, C.C. Berry, and F.D. Bushman, HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications. Genome Res, 2007. 17(8): p. 1186-94.
7. Hoffmann, C., N. Minkah, J. Leipzig, G. Wang, M.Q. Arens, P. Tebas, and F.D. Bushman, DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.