FinchTalk: DNA sequence quality

Showing posts with label DNA sequence quality. Show all posts

Friday, June 11, 2010

Levels of Quality

Next Generation Sequencing (NGS) data can produce more questions than answers. A recent LinkedIn discussion thread began with a simple question. “I would like to know how to obtain statistical analysis of data in a fastq file? number of High quality reads, "bad" reads....” This simple question opened a conversation about quality values, read mapping, and assembly. Obviously there is more to NGS data quality than simply separating bad reads from good ones.

Different levels of quality
Before we can understand data quality we need to understand what sequencing experiments measure and how the data are collected. In addition to sequencing genomes, many NGS experiments focus on measuring gene function and regulation by sequencing the fragments of DNA and RNA isolated and prepared in different ways. In these assays, complex laboratory procedures are followed to create specialized DNA libraries that are then sequenced in a massively parallel format.

Once the data are collected, they need to be analyzed in both common and specialized ways as defined by the particular application. The first step (primary analysis) converts image data, produced by different platforms into sequence data (reads). This step, specific to each sequencing platform, also produces a series of quality values (QVs), one value per base in a read, that define a probability that the base is correct. Next (secondary analysis), the reads and bases are aligned to known reference sequences, or, in the case of de novo sequencing, the data are assembled into contiguous units from which a consensus sequence is determined. The final steps (tertiary analysis) involve comparing alignments between samples or searching databases to get scientific meaning from the data.

In each step of the analysis process, the data and information produced can be further analyzed to get multiple levels of quality information that reflect how well instruments performed, if processes worked, or whether samples were pure. We can group quality analyses into three general levels: QV analysis, sequence characteristics, and alignment information.

Quality Value Analysis

Many of the data quality control (QC) methods are derived from Sanger sequencing where QVs could be used to identify low quality regions that could indicate mixtures of molecules or define areas that should be removed before analysis. QV correlations with base composition could also be used to sort out systematic biases, like high GC content, that affect data quality. Unlike Sanger sequencing, where data in a trace represent an average of signals produced by ensemble of molecules, NGS provides single data points collected from individual molecules arrayed on a surface. NGS QV analysis uses counting statistics to summarize the individual values collected from the several million reads produced by each experiment.

Examples of useful counting statistics include measuring average QVs by base position, box and whisker (BW) plots, histogram plots of QV thresholds, and overall QV distributions. Average QVs by base, BW plots, and QV thresholds are used to see how QVs trend across the reads. In most cases, these plots show the general trend that data quality decreases toward the 3’ ends of reads. Average QVs by base show each base’s QV with error bars indicating the values that are within one standard deviation of the mean. BW plots provide additional detail to show the minimum and maximum QV, the median QV and the lower and upper quartile QV values for each base. Histogram plots of QV thresholds count the number of bases below threshold QVs (10, 20, 30). This methods provides information about potential errors in the data and its utility in applications like RNA-seq or genotyping. Finally distributions of all QVs or the average QV per read can give additional indications of dataset quality.

QV analysis primarily measures sequencing and instrument run quality. For example, sharp drop offs in average QVs can identify systematic issues related to the sequencing reactions. Comparing data between lanes or chambers within a flowcell can flag problems with reagent flow or imaging issues within the instrument. In more detailed analyses, the coordinate positions for each read can be used to reconstruct quality patterns for very small regions (tiles) within a slide to reveal subtle features about the instrument run.

Sequence Characteristics
In addition to QV analysis we can look the sequences of the reads to get additional information. If we simply count the numbers of A’s, C’s, G’s, or T’s (or color values), at each base position, we can observe sequence biases in our dataset. Adaptor sequences, for example, will show sharp spikes in the data, whereas random sequences will give us an even distribution, or bias that reflects the GC content of the organism being analyzed. Single stranded data will often show a separation of the individual base lines; double stranded coverage should have equal numbers of AT and GC bases.

We can also compare each read to the other reads in the dataset to estimate the overall randomness, or complexity, of our data. Depending on the application, a low complexity dataset, one with a high number of exactly matching reads, can indicate PCR biases or a large number of repeats in the case of de novo sequencing. In other cases, like tag profiling assays, which measure gene expression by sequencing a small fragment from each gene, low complexity data are normal because highly expressed genes will contribute a large number of identical sequences.

Alignment Information
Additional sample and data quality can be measured after secondary analysis. Once the reads are aligned (mapped) to reference data sources we can ask questions that reflect both run and sample quality. The overall number of reads that can be aligned to all sources can also be used to estimate parameters related to library preparation and deposition of the molecules on the beads or slides used for sequencing. Current NGS processes are based on probabilistic methods for separating DNA molecules. Illumina, SOLiD, and 454 all differ with respect to their separation methods, but share the common feature that the highest data yield occurs when concentration of DNA is just right. The number of mappable reads can measure this property.

DNA concentration measures one aspect of sample quality. Examining which reference sources reads align to gives further information. For example, the goal of transcriptome analysis is to sequence non-ribosomal RNA (rRNA). Unfortunately rRNA is the most abundant RNA in a cell. Hence transcriptome assays involve steps to remove rRNA and a large number of rRNA reads in the data indicates problems with the preparation. In exome sequencing or other methods where certain DNA fragments are enriched, the ratio of exon (enriched) and non-exon (non-enriched) alignments can reveal how well the purification worked.

Read mapping, however, is not a complete way to measure data quality. High quality reads that do not match any reference data in the analysis pipeline could be from unknown laboratory contaminants, sequences like novel viruses or phage, or incomplete reference data. Unfortunately, the former case is the more common, so it is a good idea to include reference data for all ongoing projects in the analysis pipeline. Alignments to adaptor sequences can reveal issues related to preparation processes and PCR, and the positions of alignments can be used to measure DNA or RNA fragment lengths.

So Many Questions
The above examples provide a short tour of how NGS data can be analyzed to measure the quality of samples, experiments, protocol and instrument performances. NGS assays are complex and involve multistep lab procedures and data analysis pipelines that are specific to different kinds of applications. Sequence bases and their quality values provide information about instrument runs and some insights into samples and preparation quality. Additional, information are obtained after the data are aligned to multiple reference data sources. Data quality analysis is most useful when values are computed shortly after data are collected, and systems, like GeneSifter Lab and Analysis Editions, that automate these analyses are important investments if labs plan to be successful with their NGS experiments.

Sunday, December 6, 2009

Expeditiously Exponential: Genome Standards in a New Era

One of the hot topics of 2009 has been the exponential growth in genomics and other data and how this growth will impact data use and sharing. The journal Science explored these issues in its policy forum in Oct. In early November, I discussed the first article, which was devoted to sharing data and data standards. The second article, listed under the category “Genomics,” focuses on how genomic standards need to evolve with new sequencing technologies.

Drafting By

The premise of the article “Genome Project Standards in a New Era of Sequencing” was to begin a conversation about how to define standards for sequence data quality in this new era of ultra-high throughput DNA sequencing. One of the “easy” things to do with Next Generation Sequencing (NGS) technologies is create draft genome sequences. A draft genomic sequence is defined as a collection of contig sequences that result from one, or a few, assemblies of large numbers of smaller DNA sequences called reads. In traditional Sanger sequencing a read was between 400 and 800 bases in length and came from a single clone, or sub-clone of a large DNA fragment. NGS reads, come from individual molecules in a DNA library and vary between 36 and 800 bases in length depending on the sequencing platform being used (454, Illumina, SOLiD, or Helicos).

A single NGS run can now produce enough data to create a draft assembly for many kinds of organisms with smaller genomes such as viruses, bacteria, and fungi. This makes it possible to create many draft genomes quickly and inexpensively. Indeed the article was accompanied by a figure showing that the current growth of draft sequences exceeds the growth of finished sequences by a significant amount. If this trend continues, the ratio of draft to finished sequences will grow exponentially into the foreseeable future.

Drafty Standards

The primary purpose for a complete genome sequence is to serve as a reference for other kinds of experiments. A well annotated reference is accompanied by a catalog of genes and their functions, as well as an ordering of the genes, regulatory regions, and the sequences needed for evolutionary comparisons that further elucidate genomic structure and function. A problem with draft sequences is that they can contain a large numbers of errors that range from incorrect base calls to more problematic mis-assemblies that place bases or groups of bases in the wrong order. Because, these holes leave some sequences are more drafty than others, they are less useful in fulfilling their purpose as reference data.

If we can describe the draftiness of a genome sequence we may be able to weight its fitness for a specific purpose. The article went on to tackle this problem by recommending a series of qualitative descriptions that describe levels of draft sequences. Beginning with the Standard Draft, or an assembly of contigs of unfiltered data from one or more sequencing platforms, the terms move through High-Quality Draft, to Improved High-Quality Draft, to Annotation-Directed Improvement, to Noncontiguous Finished, to Finished. Finished sequence is defined as less than 1 error per 100,000 bases and each genomic unit (chromosomes or plasmids that are capable of replication) is assembled into a single contig with a minimal number of exceptions. The individuals proposing these standards are a well respected group in the genome community and are working with the database groups and sequence ontology groups to incorporate these new descriptions into data submissions and annotations for data that may be used by others.

Given the high cost and lengthy time required to finish genomic sequences, finishing every genome to a high standard is impractical. If we are going to work with genomes that are finished to varying degrees, systematic ways to describe the quality of the data are needed . This policy recommendations are a good start, but more needs to be done to make the proposed standards useful.

First, standards need to be quantitative. Qualitative descriptions are less useful because they create downstream challenges when reference data are used in automated data processing and interpretation pipelines. As the numbers of available genomes grow into the thousands and tens of thousands, subjective standards make the data more and more cumbersome and difficult to review. Moreover without quantitative assessment, how will one know when they have an average error rate of 1 in 100,000 bases? The authors intentionally avoided recommending numeric thresholds in the proposed standards because the instrumentation and sequencing methodologies are changing rapidly. This may be true, but future discussions nevertheless should focus on quantitative descriptions for that very reason. It is because data collection methods and instrumentation are changing rapidly that we need measures we can compare. This is the new world.

Second, the article fails to address how the different standards might be applied in a practical sense. For example, what can I expect to do with a finished genome that I cannot do with a nearly finished genome? What is a standard draft useful for? How should I trust my results and what might I expect to do to verify a finding? While the article does a good job describing the quality attributes of the data that genome centers might produce, the proposed standards would have broader impact if they could more specifically set expectations of what could be done with data.

Without this understanding, we still won't know when when our data are good enough.

Friday, October 23, 2009

Yardsticks and Sequencers

A recent question to the ABRF discussion forum, about quality values and Helicos data, led to an interesting conversation about having yardsticks to compare between Next Generation Sequencing (NGS) platforms and the common assays that are run on those platforms.

It also got me thinking, just how well can you measure things with those free wooden yardsticks you get at hardware stores and home shows?

Background

The conversation started with a question asking about what kind of quality scoring system could be applied to Helicos data. Could something similar to Phred and AB files be used?

A couple of answers were provided. One referred to the recent Helicos article in Nature Biotechnology and pointed out that Helicos has such a method. This answer also addressed the issue that quality values (QVs) need to be tuned for each kind of instrument.

Another answer, from a core lab director with a Helcos instrument, pointed out many more challenges that exist with comparing data from different applications and how software in this area is lacking. He used the metaphor of the yardstick to make the point that researchers need systematic tools and methods to compare data and platforms.

What's in a Yardstick?

I replied to the thread noting that we've been working with data from 454, Illumina GA, SOLiD and Helicos and there are multiple issues that need to be addressed in developing yardsticks to compare data from different instruments for different experiments (or applications).

At one level, there is the instrument and the data that are produced and the question is can have a standard quality measure? In Phred, we need to recall that each instrument needed to be calibrated so that quality values would be useful and equivalent across chemistries and platforms (primers, terminators, bigdye, gel, cap, AB models, MegaBACE ...). Remember phredpar.dat? Because the data were of a common type - an electropherogram - we could more or less use a single tool and define a standard. Even then, other tools (LifeTrace, KB basecaller, and LongTrace) emerged and computed standardized quality values differently. So, I would argue that we think we have a measure, but it is not the standard we think it is.

By analogy, each NGS instrument uses a very different method to generate sequences, so each platform will have a unique error profile. The good news is that quality values, as transformed error probabilities, make it possible to compare output from different instruments in terms of confidence. The bad news is that if you do not know how the error probability is computed, or you do not have enough data (control, test) to calibrate the system, error probabilities are not useful. Add to that, the fact that the platforms are undergoing rapid change as they improve chemistry, change hardware and software to increase throughput and accuracy. So, for the time being we might have yardsticks, but they have variable lengths.

The next levels deal with experiments. As noted ChiP-Seq, RNA-Seq, Me-Seq, Re-Seq, and your favorite-Seq all measure different things and we are just learning about how errors and other artifacts interfere with how well the data produced actually measure what the experiment intended to measure. Experiment level methods need to be developed so that ChiP-Seq from one platform can be compared to ChiP-Seq from another platform and so on. However, the situation is not dire because in the end, DNA sequences are the final output and for many purposes the data produced are much better now then they have been in the past. As we push sensitivity, the issues already discussed become very relevant.

As a last point, the goal many researchers will have is to layer data from on experiment on another experiment, correlate ChIP-Seq with RNA-Seq for example and to do that you not only need to have quality measures for data, sample, experiment, you also need ways to integrate all of this experimental information with already published data. There is a significant software challenge ahead and, as pointed out, cobbling solutions together is not a long term feasible answer. The datasets are getting to big and complex and at the same time the archives are busting with data generated by others.

So what does this have to do with yardsticks?

Back to yardsticks. Those cheap wooden yardstick expand and contract with temperature and humidity, so at different times a yardstick's measurements will change. This change is the uncertainty of the measurement (see additional reading below), which defines the precision of our measuring device. If I want a quick estimate of how tall my dog stands, I would happily use the wooden yardstick. However, if I want to measure something to within a 32nd of an inch or millimeter, I would use a different tool. The same rules apply to DNA sequencing, for many purposes the reads are good enough and data redundancy overcomes errors, but as we push sensitivity and want to measure changes in fewer molecules, discussions about how to compute QVs and annotate data, so that we know which measuring device was used, become very important.

Finally, I often see in the literature, company brochures, and hear in conversation that refer to QVs as Phred scores. Remember: Only Phred makes Phred QVs - everything else is Phred-like, but only if it is a -10log(P) transformation of an error probability.

Additional Reading:

Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Sunday, November 9, 2008

Next Gen-Omics

Advances in Next Gen technologies have led to a number of significant papers in recent months, highlighting their potential to advance our understanding of cancer and human genetics (1-3). These and the other 100's of papers demonstrate the value of Next Gen sequencing. The work completed thus far has been significant, but much more needs to be done to make these new technologies useful for a broad range of applications. Experiments will get harder.

While much of the discussion in the press focuses on rapidly sequencing human genomes for low cost as part of the grail of personalized genomics (4), a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of "omics" have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry.

As noted in the last post we are developing proposals to further advance the state-of-the-art in working with Next Gen data sets. In one of those proposals, Geospiza will develop novel approaches to work with data from applications of Next Gen sequencing technologies that are being developed study the omics of DNA transcription and gene expression.

Toward furthering our understanding of gene expression, Next Gen DNA sequencing is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw experimental conclusions. Recall the three phases of data analysis.

However, to be useful these data sets need to come from experiments that measure what we think they should measure. The data must be high quality and free of artifacts. In order to compare quantitative information between samples, the data sets must be refined and normalized so that biases introduced through sample processing are accounted for. Thus, a fundamental challenge to performing these kinds of experiments is working with the data sets that are produced. In this regard numerous challenges exist.

The obvious ones relating to data storage and bioinformatics are being identified in both the press and scientific literature (5,6). Other, less published, issues include a lack of:

standard methods and controls to verify datasets in the context of their experiments,
standardized ways to describe experimental information and
standardized quality metrics to compare measurements between experiments.

Moreover data visualization tools and other user interfaces, if available, are primitive and significantly slow that pace at which a researcher can work with the data. Finally, information technology (IT) infrastructures that can integrate the system parts dealing with sample tracking, experimental data entry, data management, data processing and result presentation are incomplete.

We will tackle the above challenges by working with the community to develop new data analysis methods that can run independently and within Geospiza's FinchLab. FinchLab handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible order interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. Geospiza's hosted (Software as a Service [SaaS]) delivery models remove additional IT barriers.

FinchLab's data management and analysis server make the system scalable through a distributed architecture. The current implementation of the analysis server creates a complete platform to rapidly prototype new data analysis workflows and will allow us to quickly devise and execute feasibility tests, experiment with new data representations, and iteratively develop the needed data models to integrate results with experimental details.

References

1. Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66-72 (2008).

2. Wang, J., Wang, W., Li, R., Li, Y., et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008).

3. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008).

4. My genome. So what? Nature 456, 1 (2008).

5. Prepare for the deluge. Nature Biotechnology 26, 1099 (2008).

6. Byte-ing off more than you can chew. Nature Methods 5, 577 (2008).

Wednesday, April 16, 2008

Expectations Set the Rules

Genetic analysis workflows are complex. Biology is non-deterministic, so we continually experience new problems. Lab processes and our data have natural uncertainty. These factors conspire against us to make our world rich in variability and processes less than perfect.

That keeps things interesting.

In a previous post, I was able to show how sequence quality values could be used to summarize the data for a large resequencing assay. Presenting "per read" quality values in a grid format allowed us to visualize samples that had failed as well as observe that some amplicons contained repeats that led to sequencing artifacts. We also were able to identify potential sample tracking issues and left off with an assignment to think about how we might further test sample tracking in the assay.

When an assay is developed there are often certain results that can be expected. Some results are defined explicitly with positive and negative controls. We can also use assay results to test that the assay is producing the right kinds of information. Do the data make sense? Expectations can be derived from the literature, an understanding of statistical outcomes, or internal measures.

Genetic assays have common parts

A typical genetic resequencing assay is developed from known information. The goal is to collect sequences from a defined region of DNA for a population of individuals (samples) and use the resulting data to observe the frequency of known differences (variants) and identify new patterns of variation. Each assay has three common parts:

Gene Model - Resequencing and genotyping projects involve comparative analysis of new data (sequences, genotypes) to reference data. The Gene Model can be a chromosomal region or specific gene. A well-developed model will include all known genotypes, protein variations, and phenotypes. The Gene Model represents both community (global) and laboratory (local) knowledge.

Assay Design - The Assay Design defines the regions in the Gene Model that will be analyzed. These regions, typically prepared by PCR are bounded by unique DNA primer sequences. The PCR primers have two parts: one part is complementary to the reference sequence (black in the figure), the other part is "universal" and is complementary to a sequencing primer (red in the figure). The study includes information about patient samples such as their ethnicity, collection origin, and phenotypes associated with the gene(s) under study.

Experiments / Data Collection / Analysis - Once the study is designed and materials arrive, samples are prepared for analysis. PCR is used to amplify specific regions for sequencing or genotyping. After a scientist is confident that materials will yield results, data collection begins. Data can be collected in the lab or the lab can outsource their sequencing to core labs or service companies. When data are obtained, they are processed, validated, and compared to reference data.

Setting expectations

A major challenge for scientists doing resequencing and genotyping projects arises when trying to evaluate data quality and determine the “next steps.” Rarely does everything work. We've already talked about read quality, but there are also the questions of whether the data are mapping to their expected locations, and whether the frequencies of observed variation are expected. The Assay Design can be used to verify experimental data.

The Assay Design tells us where the data should align and how much variation can be expected. For example, if the average SNP frequency is 1/1300 bases, and an average amplicon length is 580 bases, we should expect to observe one SNP for every two amplicons. Furthermore, in reads where a SNP may be observed, we will see the difference in a subset of the data because some, or most, of the reads will have the same allele as the reference sequence.

To test our expectations for the assay, the 7488 read data set is summarized in a way that counts the frequency of disagreements between read data and their reference sequence. The graph below shows a composite of read discrepancies (blue bar graph) and average Q20/rL, Q30/rL, Q40/rL values (colored line graphs). Reads are grouped according to the number of discrepancies observed (x-axis). For each group, the count of reads (bar height) and average Q20/rL (green triangles), Q30/rL (yellow squares), and Q40/rL (purple circles) are displayed.

In the 7488 read data set, 95% (6914) of the reads gave alignments. Of the aligned data, 82% of the reads had between 0 and 4 discrepancies. If we were to pick which traces to review and which to samples to redo, we would likely focus our review on the data in this group and queue the rest (18%) for redos to see if we could improve the data quality.

Per our previous prediction, most of the data (5692 reads) do not have any discrepancies. We also observe that the number of discrepancies increases as the overall data quality decreases. This is expected because the quality values are reflecting the uncertainty (error) in the data.

Spotting tracking issues

We can also use our expectations to identify sample tracking issues. Once an assay is defined, the positions of all of the PCR primers are known, hence we should expect that our sequence data will align to the reference sequence in known positions. In our data set, this is mostly true. Similar to the previous quality plots of samples and amplicons, an alignment "quality" can be defined and displayed in a table where the rows are samples and columns are amplicons. Each sample has two rows (one forward and one reverse sequence). If the cells are colored according to alignment start positions (green for expected, red for unexpected, white for no alignment) we can easily spot which reads have an "unexpected" alignment. The question then becomes, where/when did the mixup occur?

From these kinds of analyses we can get a feel for whether a project is on track and whether there are major issues that will make our lives harder. In future posts I will comment on other kinds of measures that can be made and show you how this work can be automated in FinchLab.

Tuesday, April 8, 2008

Exceptions are the Rule

Genetic analysis workflows are complex. You can expect that things will go wrong in the laboratory. Biology also manages to interfere and make things harder than you think they should be. Your workflow management system needs to show the relevant data, allow you to observe trends, and have flexible points were procedures can be repeated.

In the last few posts, I introduced genetic analysis workflows, concepts about lab and data workflows, and discussed why it is important to link the lab and data workflows. In this post I expand on the theme and show how a workflow system like the Geospiza FinchLab can be used to troubleshoot laboratory processes.

First, I'll review our figure from last week. Recall that it summarized 4608 paired forward / reverse sequence reads. Samples are represented by rows, and amplicons by column, so that each cell represents a single read from a sample and one of its amplicons. Color is used to indicate quality, with different colors showing the the number of Q20 bases divided by the read length (Q20/rL). Green is used for values between 0.60 and 1.00, blue for values between 0.30 and 0.59, and red for values less than 0.29. The summary showed patterns that, indicated lab failures and biological issues. You were asked to figure them out. Eric from seqanswers (a cool site for Next Gen info) took a stab at this, and got part of the puzzle solved.

Sample issues

Rows 1,2 and 7,8 show failed samples. We can spot this because of the red color across all the amplicons. Either the DNA preps failed to produce DNA, or something interfered with the PCR. Of course there are those pesky working reactions for both forward and reverse sequence in sample 1 column 8. My first impression is that there is a tracking issue. The sixth column also has one reaction that worked. Could this result indicate a more serious problem in sample tracking?

Amplicon issues

In addition to the red rows, some columns show lots of blue spots; these columns correspond to amplicons 7, 24 and 27. Remember that blue is an intermediate quality. An intermediate quality could be obtained if part of the sequence is good and part of the sequence is bad. Because the columns represent amplicons, when we see a pattern in a column it likely indicates s systematic issue for that amplicon. For example, in column 7, all of the data are intermediate quality. Columns 24 and 27 are more interesting because the striping pattern indicates that one sequencing reaction results in data with intermediate quality while the other looks good. Wouldn't it be great if we could drill down from this pattern and see a table of quality plots and also get to the sequence traces?

Getting to the bottom

In FinchLab we can drill down and view the underlying data. The figure below summarizes the data for amplicon 24. The panel on the left is the expanded heat map for the data set. The panel on right is a folder report summarizing the data from 192 reads for amplicon 24. It contains three parts: An information table that provides an overview of the details for the reads. A histogram plot that counts how many reads have a certain range of Q20 values, and a data table that summarizes each read in a row containing its name, the number of edit revisions, its Q20, Q20/rLen values, and a thumbnail quality plot showing the quality values for each base in the read. In the histogram, you can see that two distinct peaks are observed. About half the data have low Q20 values and half have high Q20 values, producing the striping pattern in the heat map. The data table shows two reads; one is the forward sequence and the other is its "reverse" pair. These data were brought together using the table's search function, in the "finder" bar. Note how the reads could fit together if one picture was reversed.

Could something in the sequence be interfering with the sequencing reaction?

To explore the data further, we need to look at the sequences themselves. We can do this by clicking the name and viewing the trace data online in our web browser, or we could click the FinchTV icon and view the sequence in FinchTV (bottom panel of the figure above). When we do this for the top read (left most trace) we see that, sure enough, there is a polyT track that we are not getting through. During PCR such regions can cause "drop outs" and result in mixtures of molecules that differ in size by one or two bases. A hallmark of such a problem is a sudden drop in data quality at the end of the poly nucleotide track because the mixture of molecules creates a mess of mixed bases. This explanation confirmed by the other read. When we view it in FinchTV (right most trace) we see poor data at the end of the read. Remember these data are reversed relative to the first read so when we reverse complement the trace (middle trace), we see that it "fits" together with the first read. A problem for such amplicons is that we now have only single stranded coverage. Since this problem occurred at the end of the read, half of the data are good and the other half are poor quality. If the problem occurred in the middle of the read, all of the data would show an intermediate quality like amplicon 7.

In genetic analysis data quality values are an important tool for assessing many lab and sample parameters. In this example, we were able to see systematic sample failures and sequence characteristics that can lead to intermediate quality data. We can use this information to learn about biological issues that interfere with analysis. But what about our potential tracking issue?

How might we determine if our samples are being properly tracked?

Friday, April 4, 2008

Lab work without data analysis and management is doo doo

As we begin to contemplate next generation sequence data management, we can use Sanger sequencing to teach us important lessons. One of which, is the value of linking laboratory and data workflows to be able to view information in the context of our assays and experiments.

I have been fortunate to hear J. Michael Bishop speak on a couple of occasions. He ended these talks by quoting one of his biochemistry mentors, "genetics without biochemistry is doo doo." In a similar vein, lab work without data analysis and management is doo doo. That is when you separate the lab from the data analysis, you have to work through a lot of doo to figure things out. Without a systematic way to view summaries of large data sets, the doo is overwhelming.

To illustrate, I am going to share some details about a resequencing project we collaborated on. We came to this project late, so much of the data had been collected, and there were problems, lots of doo. Using Finch however, we could quickly organize and analyze the data, and present information in summaries with drill downs to the details to help troubleshoot and explain observations that were seen in the lab.

10,686 sequence reads: forward / reverse sequences from 39 amplicons from 137 individuals

The question being asking in this project was: are there new variants in a gene that are related to phenotypes observed in a specialized population? This is the kind of question medical researchers ask frequently. Typically they have a unique collection of samples that come from a well understood population of individuals. Resequencing is used to interrogate the samples for rare variants, or genotypes.

In this process, we purify DNA from sample material (blood), and use PCR with exon specific probes to amplify small regions of DNA within the gene. The PCR primers have regions called universal adaptors. Our sequencing primers will bind to those regions. Each PCR product, called an amplicon, is sequenced twice, once from each strand to give double coverage of the bases.

When we do the math, we will have to track the DNA for 137 samples and 5343 amplicons. Each amplicon is sequenced, at a minimum twice, to give us 10,686 reads. From a physical materials point of view that means 137 tubes with sample; 56, 96-well plates for PCR; and 112, 96-well plates for sequencing. In a 384-well format we could have used 14 plates for PCR and 28 plates for sequencing. For a genome center, this level of work is trivial, but for a small lab this is significant work and things can happen. Indeed as not all the work is done in a single lab the process can be more complex. And you need to think about how you would lay this out - 96 does not divide by 39 very well.

From a data perspective, we can use sequence quality values to identify potential laboratory and biological issues. The figure below summarizes 4608 reads. Each pair of rows is one sample (forward / reverse sequence pairs, alternating gray and white - 48 total). Each column is an amplicon. Each cell in the table represents a single read from an amplicon and sample. Color is used to indicate quality. In this analysis, quality is defined as the ratio of Q20 to read length (Q20/rL), which works very well for PCR amplicons. The better the data, the closer this ratio is to one. In the table below, green indicates Q20/rL values between 0.60 and 1.00, blue indicates values between 0.30 and 0.59, and red indicates Q20/rL values less than 0.29. The summary shows patterns that, as we will learn next week, show lab failures and biological issues. See if you can figure them out.

Monday, March 31, 2008

Next Gen, Next Step

Congratulations! You just got approval to purchase your next generation sequencer! What are you going to do next?

Today, there is a lot being written about the data deluge accompanying Next Gen sequencers. It's true, they produce a lot of data. But even more important are the questions about how you plan to set up the lab and data workflows to turn those precious samples into meaningful information. The IT problems, while significant, are only the tip of the iceberg. If you operate a single lab, you will need to think about your experiments, how to track your samples, how to prepare DNA for analysis, how to move the data around for analysis, and how to do your analyses to get meaningful information out of the data. If you operate a core lab, you have all the same problems, but you're providing that service for a whole community of scientists. You'll need to keep their samples and data separated and secure. You also have to figure out how to get the data to your customers and how you might help them with their analyses.

Never mind that you need multi terabytes of storage and a computer cluster. Without a plan and strategy for running your lab, organizing the data, running multistep analysis procedures, and sifting through 100's of thousands of alignments, you'll just end up with a piece of lab art: a Next Gen sequencer, a big storage system and a computer cluster. (By the way, have you found a place for this yet?) It may look nice, but that's probably not what you had in mind.

To get the most of out of your investment, you'll need to think about workflows, and how to manage those workflows.

The cool thing about Next Gen technology are the kinds of questions that can be asked with the data. This requires both novel ways to work with DNA and RNA and novel ways to work with the data. We call those procedures "workflows." Simply put, a workflow describes a multistep procedure and its decision points. In each step, we work with materials and the materials may be "transformed" in the step. You can also describe a workflow as a series of steps that have inputs and outputs. Workflows are run both in the lab and on the computer.

In a protocol for isolating DNA , we can take tissue (the input) lyse the cells with detergent, bind the DNA to a resin, wash away junk, and elute purified DNA (the output). The purified DNA may then become an input to a next step, like PCR, to create an output, like a collection of amplicons. Similar processes can be used with RNA. In a Next Gen lab workflow, you fragment the DNA, ligate adaptors, and use the adaptors to attach DNA to beads or regions of a slide. From a few basic lab workflows, we can prepare genetic material for whole genome analysis, expression analysis, variation analysis, gene regulation, and other experiments in both discovery, and diagnostic assays.

In a software workflow, data are the material. Input data, typically packaged in files, are processed by programs to create output data. These data or information can also be packaged in files or even stored in databases. Software programs execute the steps and scripts often automate series of steps. Digital photography, multimedia production, and business processes all have workflows. So does bioinformatics. The difference is that bioinformatics workflows lack standards so many people work harder than needed and spend a lot of time debugging things.

As the scale increases, the lab and analysis workflows must be managed together.

A common laboratory practice has been to collect the data, and then analyze the data in separate independent steps. Lab work is often tracked on paper, in Excel spreadsheets, or in a LIMS (Laboratory Information Management System). The linkage between lab processes, raw data, and final results, is typically poor. In small projects, this is manageable. File naming conventions can track details and computer directories (folders) can be used to organize data files. But as the scale grows, the file names get longer and longer, people spend considerable time moving and renaming data, the data start to get mixed up, become harder to find, and for some reason files start to replicate themselves. Now, the lab investigates tracking problems and lost data, instead of doing experiments.

Why? Because the lab and data analysis systems are disconnected.

The good news is that Geospiza Finch products can link your lab procedures and the data handling procedures to create complete workflows for genetic analysis.

Monday, March 24, 2008

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Next generation sequencing will transform sequencing assays and experiments. Understanding how the data are generated is important for interpreting results.

In my last post, I discussed how all measurement systems have uncertainty (error) and how error probability is determined in Sanger sequencing. In this post, I discuss the current common next generation (Next Gen) technologies and their measurement uncertainty.

The Levels of Data - To talk about Next Gen data and errors, it is useful to have a framework for describing these data. In 2005, the late Jim Gray and colleagues published a report addressing scientific data management in the coming decade. The report defines data in three levels (Level 0, 1, and 2). Raw data (Level 0), the immediate product of analytical instruments, must be calibrated and transformed into Level 1 data. Level 1 data are then combined with other data to facilitate analysis (Level 2 datasets). Interesting science happens when we make different combinations of Level 2 data and have references to Level 1 data for verification.

How does our data level framework apply to the DNA sequencing world?

In Sanger sequencing, Level 0 data are the raw analog signal data that are collected from the laser and converted to digital information. Digital signals, displayed as waveforms, are mobility corrected and basecalled to produce Level 1 data files that contain information about the collection event, the DNA sequence (read), quality values, and averaged intensity signals. Read sequences are then combined together, or with reference data, to produce Level 2 data in the form of aligned sequence datasets. These Level 2 data have many context specific meanings. They could be a list of annotations, a density plot in a genome browser view, an assembly layout, or a panel of discrepancies that use quality values (from Level 1) to distinguish true genetic variation from errors (uncertainty) in the data.

Next Gen data are different.

When the "levels of data" framework is used to explore Next Gen sequencing technology, we see fundamental differences between Level 0 and Level 1 data . In Sanger sequencing, we perform a reaction to create a mixture of fluorescently tagged molecules that differ in length by a single base. The mixture is then resolved by electrophoresis with continual detection of these size separated tagged DNA molecules to ultimately create a DNA sequence. Uncertainties in the basecalls are related to electrophoresis conditions, compressions due to DNA secondary structure, and the fact that some positions have mixed bases because the sequencing reactions contain a collection of molecules.

In Next Gen sequencing, molecules are no longer separated by size. Sequencing is done "in place" on DNA molecules that were amplified from single molecules. These amplified DNA molecules are either anchored to beads (that are later randomly bound to slides or picoliter wells) or anchored to random locations on slides. Next Gen reactions then involve multiple cycles of chemical reaction (flow) followed by detection. The sample preparation methods and reaction/detection cycles are where the current Next Gen technologies greatly differ and have their unique error profiles.

Next Gen Level 0 data changes from electropherograms to huge collections of images produced through numerous cycles of base (or DNA) extensions and image analysis. Unlike Sanger sequencing, where both raw and processed data can be stored relatively easily in single files, Next Gen Level 0 data sets can be quite large and multiple terabytes can be created per run. Presently there is active debate on the virtues of storing Level 0 data (the Data Life Cycle). The Level 0 image collections are used to create Level 1 data. Level 1 data are a continuum of information that are ultimately used to produce reads and quality values. Next Gen quality values reflect the basic underlying sequencing chemistry and primary error modes, and thus calculations need to be optimized for each platform. For those of you who are familiar with phred, these are analogous to the settings in the phredpar.dat file. The other feature of Level 1 data are that they can be expressed differently.

The Spaces

Flow Space - The Roche 454 technology, is based on pyrosequencing [1] and the measured signals are a function of how many bases are incorporated in a base addition cycle. In pyrosequencing, we measure the pyrophosphate (PPi) that is released when a nucleotide is added to a growing DNA chain. If multiple bases are added in a cycle, two or more A's for example, proportionally more (PPi) is released and the light detected is more intense. As more bases are added in a row, the relative increase in light decreases; an 11/10 change ratio for example, is much lower than 2/1. Consequently, when there are longer sequences with the same base (e.g. AAAAA), it becomes harder to count the number of bases accurately, and the error rate increases. Flow space describes a sequence in terms of base incorporations. That is, the data are represented as an ordered string of bases plus the number of bases at each base site. The 454 software performs alignment calculations in flow space.

Color Space - The Applied Biosystems SOLiD technology uses DNA ligation [2] with collections of synthetic DNA molecules (oligos) that contain two nested known bases with a fluorescent dye. In each cycle these two bases are read at intervals of five bases. Getting full sequence coverage (25, 35, or more bases), requires multiple ligation and priming cycles such that each base is sequenced twice. The SOLiD technology uses this redundancy to decrease the error probability. The Level 1 sequence is reported in a “color” space, where the numbers 0,1,2,3 are used to represent one of the fluorescent colors. Color space must be decoded into a DNA sequence at the final stage of data processing. Like flow space, it is best to perform alignments using data in color space. Since each color represents two bases, decoding color space requires that the first base be known. This base came from the adapter.

Sequence Space - Illumina’s Solexa technology uses single base extensions of fluorescent-labeled nucleotides with protected 3'-OH groups. After base addition and detection, the 3'-OH is deprotected and the cycle repeated. The error probabilities are calculated in different ways by first analyzing the raw intensity files (remember firecrest and bustard) and then by alignment with the ELAND program to compute an empirical probability error. With Solexa data, errors occur more frequently at the ends of reads.

So, what does this all mean?

Next Gen technologies are all different in the way data are produced, they all produce different kinds of Level 1 data, and the Level 1 data are best analyzed within their own unique "space." This can have data integration implications depending on how you might want to combine data sets (Level 1 vs Level 2). Clearly, it is important to have easy access to quality values and error rates to help troubleshoot issues with runs and samples. The challenge is sifting out the important data from a morass of sequence files, quality files, data encodings, and other vagaries of the instruments and their software. Geospiza is good at this and we can help.

In terms of science, many assays and experiments, like Tag and Count, or whole genome assembly, can use redundancy to overcome random data errors. Data redundancy can also be used to develop rules to validate variations in DNA sequences. But, this is the tip of the iceberg. With Next Gen's extremely high sampling rates and molecular resolution, we can begin to think of finding very rare differences in complex mixtures of DNA molecules and use sequencing as a quantitative assay. In these cases understanding how the data are created and their corresponding uncertainties are a necessary step toward making full use of the data. The good news is that there is some good work being done in this area.

Remember, the differences we measure must be greater than the combined uncertainty of our measurements.

Further reading
1. Ronaghi, M., M. Uhlen, and P. Nyren, "A sequencing method based on real-time pyrophosphate." Science, 1998. 281(5375): p. 363, 365.

2. Shendure, J., G.J. Porreca, N.B. Reppas, et al., "Accurate multiplex polony sequencing of an evolved bacterial genome." Science, 2005. 309(5741): p. 1728-32.

3. "Quality scores and SNP detection in sequencing-by-synthesis systems." http://www.genome.org/cgi/content/abstract/gr.070227.107v2

4. "Scientific data management in the coming decade." http://research.microsoft.com/research/pubs/view.aspx?tr_id=860

5. Solexa quality values. http://rcdev.umassmed.edu/pipeline/Alignment%20Scoring%20Guide%20and%20FAQ.html
http://rcdev.umassmed.edu/pipeline/What%20do%20the%20different%20files%20in%20an%20analysis%20directory%20mean.html

Monday, March 17, 2008

Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Next generation DNA sequencing introduces new concepts like color space, flow space, and sequence space. You might ask, what's a space? How do I deal with these spaces? Why are they important?

In this two part blog, I will first talk about error analysis in DNA sequencing. Next I will talk about how we might think about error analysis in next generation sequencing.

Last week I came across a story about an MIT physics professor, Walter Lewin, who captivates his student audiences with his lectures and creative demonstrations. MIT and iTunes have 100 of his lectures on line. I checked out the first one - your basic first college physics lecture that focuses on measurement and dimensional analysis - and agree, Lewin is captivating. I watched the entire lecture, and it made me think about DNA sequencing.

In the lecture, Lewin, proves "physics works!" and how his grandmother was right when she said that you are inch taller when laying down than when standing up. He used a student subject and measured his length laying down and standing up. Sure enough, the student was an inch longer laying down. But that was not the point. The point was - Lewin proved his grandmother was right because the change in the student's length was greater than the uncertainty of his measuring device (the ruler). Every measurement we make has uncertainty, or error, and for a comparison to be valid the difference in measures have to be greater than their combined uncertainties.

What does this have to do with DNA sequencing?

Each time we collect DNA sequence data we are making many measurements. That is, we are determining the bases of a DNA sample template in an in vitro replication process that allows us to "read" each base of the sequence. The measurements we collect, the string of DNA bases, therefore have uncertainty. We call this uncertainty in base measurement the error probability. In Sanger sequencing, Phil Green and Brent Ewing developed the Phred basecalling algorithm to measure per base error probabilities.

Error probabilities are small numbers (1/100, 1/10,000, 1/1,000,000). Rather than work with small fractions and decimal values with many leading zeros, we express error probabilities as positive whole integers, called quality values (QVs), by applying a transformation:

QV = -l0*log(P), where P is the error probability.

With this transformation our 1/100, 1/10,000, and 1/1,000,000 error probabilities become QVs of 20, 40, and 60, respectively.

The Phred basecalling algorithm has had a significant impact on DNA sequencing because it demonstrated that we could systematically measure the uncertainty of each base determination in a DNA sequence. Over the past 10 years, Phred quality values have been calibrated through many resequencing projects and are thus statistically powerful. An issue with Phred, and any basecaller, however is that it must be calibrated for different electrophoresis instruments (measurement devices) and that is why different errors and error rates can be observed with different combinations of basecallers and instruments.

Sequencing redundancy also reduces error probabilities

The gold standard in DNA sequencing is to sequence both strands of a DNA molecule. This is for good reason. Each stand represents an independent measurement. If our measurements agree, they can be better trusted, and if they disagree one needs to look more closely at the underlying data, or remeasure. This concept was also incorporated into Green's assembly program Phrap (unpublished).

Within the high throughput genomics community it is well understood that increasing the redundancy of data collection reduces error. In theory, one can automate the interpretation of DNA sequencing experiments, or assays, by collecting data at sufficient redundancy. The converse is also true, and I see people work the hardest with manually reviewing data when they do not collect enough. This is most common with variant detection resequencing assays.

Why isn't high redundancy data collection routine?

The challenges with high redundancy data collection in Sanger sequencing involve the high relative costs of collecting data and higher costs of collecting data from single molecules. Next generation (Next Gen) sequencing changes this landscape.

The higher throughput rates and lower costs of Next Gen sequencing hold great promise for revolutionizing genomics research and molecular diagnostics. In a single instrument run, an Expression Sequence Tag (EST) experiment can yield millions of sequences and detect rare transcripts that cannot be found any other way [1-3]. In cancer research, high sampling rates will allow for the detection of rare sequence variants in populations of tumor cells that could be prognostic indicators or provide insights for new therapeutics [1, 4, 5]. In viral assays, it will be possible to determine the sequence of individual viral genomes and detect drug resistant strains as they appear [6, 7]. Next Gen sequencing has considerable appeal because the large numbers of sequences that can be obtained make statistical calculations more valid.

Making statistical calculations valid, however, requires that we understand the inherit uncertainty of our measuring device. In this case, the different Next Gen genetic analyzers. That's where color space, flow space, and other spaces come into play.

Further Reading
1. Meyer, M., U. Stenzel, S. Myles, K. Prufer, and M. Hofreiter, Targeted high-throughput sequencing of tagged nucleic acid samples. Nucleic Acids Res, 2007. 35(15): p. e97.
2. Korbel, J.O., A.E. Urban, J.P. Affourtit, et al., Paired-end mapping reveals extensive structural variation in the human genome. Science, 2007. 318(5849): p. 420-6.
3. Wicker, T., E. Schlagenhauf, A. Graner, T.J. Close, B. Keller, and N. Stein, 454 sequencing put to the test using the complex genome of barley. BMC Genomics, 2006. 7: p. 275.
4. Taylor, K.H., R.S. Kramer, J.W. Davis, J. Guo, D.J. Duff, D. Xu, C.W. Caldwell, and H. Shi, Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res, 2007. 67(18): p. 8511-8.
5. Highlander, S.K., K.G. Hulten, X. Qin, et al., Subtle genetic changes enhance virulence of methicillin resistant and sensitive Staphylococcus aureus. BMC Microbiol, 2007. 7(1): p. 99.
6. Wang, G.P., A. Ciuffi, J. Leipzig, C.C. Berry, and F.D. Bushman, HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications. Genome Res, 2007. 17(8): p. 1186-94.
7. Hoffmann, C., N. Minkah, J. Leipzig, G. Wang, M.Q. Arens, P. Tebas, and F.D. Bushman, DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.

Monday, February 11, 2008

Using the Finch Q >20 plots to evaluate your data

All of the Finch systems: Solutions Finch, FinchLab, and iFinch; have a folder report with visual snapshots that summarize the quality of data in that folder. The Q20 histogram plot is one of those tools and in these next two posts, I'll describe what we can learn from these plots.

First, we'll talk about the values on the x axis. When we use the term "Q> 20 bases," we're referring to the number of bases in a read that have a quality value greater than 20. If a base has a quality value of 20, there is a 1 in 100 chance that the base has been misidentified. We use the Q20 value to mark a threshold point where a base has an acceptable quality value.

Histogram plots work by consolidating data that fit into a certain range. In the graph above, you can see that on the x axis, we show groups of reads. The first group contains reads that have less than 50 good (Q > 20) bases. The next group contains reads that have between 50 and 99 good bases, next 100 to 149, and so on.

On the y axis, we show the number of reads that fall into each group. In this graph, we have almost 30 reads that have over 950 good quality bases.

Uhmm, uhmm, uhhmmm, good sequence data, just the stuff I like to see.