Different levels of quality
Before we can understand data quality we need to understand what sequencing experiments measure and how the data are collected. In addition to sequencing genomes, many NGS experiments focus on measuring gene function and regulation by sequencing the fragments of DNA and RNA isolated and prepared in different ways. In these assays, complex laboratory procedures are followed to create specialized DNA libraries that are then sequenced in a massively parallel format.
In each step of the analysis process, the data and information produced can be further analyzed to get multiple levels of quality information that reflect how well instruments performed, if processes worked, or whether samples were pure. We can group quality analyses into three general levels: QV analysis, sequence characteristics, and alignment information.
Quality Value Analysis
Many of the data quality control (QC) methods are derived from Sanger sequencing where QVs could be used to identify low quality regions that could indicate mixtures of molecules or define areas that should be removed before analysis. QV correlations with base composition could also be used to sort out systematic biases, like high GC content, that affect data quality. Unlike Sanger sequencing, where data in a trace represent an average of signals produced by ensemble of molecules, NGS provides single data points collected from individual molecules arrayed on a surface. NGS QV analysis uses counting statistics to summarize the individual values collected from the several million reads produced by each experiment.
QV analysis primarily measures sequencing and instrument run quality. For example, sharp drop offs in average QVs can identify systematic issues related to the sequencing reactions. Comparing data between lanes or chambers within a flowcell can flag problems with reagent flow or imaging issues within the instrument. In more detailed analyses, the coordinate positions for each read can be used to reconstruct quality patterns for very small regions (tiles) within a slide to reveal subtle features about the instrument run.
In addition to QV analysis we can look the sequences of the reads to get additional information. If we simply count the numbers of A’s, C’s, G’s, or T’s (or color values), at each base position, we can observe sequence biases in our dataset. Adaptor sequences, for example, will show sharp spikes in the data, whereas random sequences will give us an even distribution, or bias that reflects the GC content of the organism being analyzed. Single stranded data will often show a separation of the individual base lines; double stranded coverage should have equal numbers of AT and GC bases.
Additional sample and data quality can be measured after secondary analysis. Once the reads are aligned (mapped) to reference data sources we can ask questions that reflect both run and sample quality. The overall number of reads that can be aligned to all sources can also be used to estimate parameters related to library preparation and deposition of the molecules on the beads or slides used for sequencing. Current NGS processes are based on probabilistic methods for separating DNA molecules. Illumina, SOLiD, and 454 all differ with respect to their separation methods, but share the common feature that the highest data yield occurs when concentration of DNA is just right. The number of mappable reads can measure this property.
Read mapping, however, is not a complete way to measure data quality. High quality reads that do not match any reference data in the analysis pipeline could be from unknown laboratory contaminants, sequences like novel viruses or phage, or incomplete reference data. Unfortunately, the former case is the more common, so it is a good idea to include reference data for all ongoing projects in the analysis pipeline. Alignments to adaptor sequences can reveal issues related to preparation processes and PCR, and the positions of alignments can be used to measure DNA or RNA fragment lengths.
So Many Questions
The above examples provide a short tour of how NGS data can be analyzed to measure the quality of samples, experiments, protocol and instrument performances. NGS assays are complex and involve multistep lab procedures and data analysis pipelines that are specific to different kinds of applications. Sequence bases and their quality values provide information about instrument runs and some insights into samples and preparation quality. Additional, information are obtained after the data are aligned to multiple reference data sources. Data quality analysis is most useful when values are computed shortly after data are collected, and systems, like GeneSifter Lab and Analysis Editions, that automate these analyses are important investments if labs plan to be successful with their NGS experiments.