FinchTalk: June 2010

Tuesday, June 29, 2010

GeneSifter Lab Edition 3.15

Last week we released GeneSifter Laboratory Edition (GSLE) 3.15. From NGS quality control data, to improved microarray support, to Sanger sequencing support, to core lab branding, and many others, there are a host of features and improvements for everyone that continue to make GSLE the leading LIMS for genetic analysis.

The three big features include QC Analysis of Next Generation Sequencing (NGS) Data and Microarrays, and core lab branding support.

To better troubleshoot runs, and view data quality for individual samples in a multiplex, the data within fastq, fasta, or csfasta (and quality) files are used to generate quality report graphics (figure below). These include the overall base (color) composition, average, per base, quality values (QVs), box and whisker plots showing median, lower and upper quartile, and minimum and maximum QVs at each base position, and error analysis indicating the number of QVs below 10, 20 and 30. A link is also provided to conveniently view the sequence data in pages, so that GBs of data do not stream into your browser.
For microarray labs, quality information from CHP and CEL files and a probe intensity data from CEL files are displayed. Please contact support@geospiza.com to activate of Affymetrix Settings and configure the CDF file path and power tools.
For labs that use their own ordering systems, a GSLE data view page has been created that can be embedded in a core lab website. To support end user access, a new user role, Data Viewer, has been created to limit access to only view folders and data within the user's lab group. Please contact support@geospiza.com to activate the feature.
The ability to create html tables in the Welcome Message for the Home page has been returned to provide additional message formatting capabilities.

Laboratory Operations
Several features and improvements introduced in 3.15 help prioritize steps, update items, improve ease of use, and enhance data handling.

Instrument Runs

A time/date stamp has been added to the Instrument Run Details page to simplify observing when runs were completed.
Partial Sanger (CE) runs can be manually completed (like NGS runs) instead of having to have all reactions be complete or fail the remaining.
The NGS directory view of result files now provides deletion actions (by privileged users) so labs can more easily manage disk usage.

Sample Handling

Barcodes can be recycled or reused for templates that are archived to better support labs using multiple lots of 2D barcode tubes. However, the template barcode field remains unique for active templates.
Run Ready Order Forms allow a default tag for the Plate Label to populate the auto-generated Instrument Run Name to make Sanger run set up quicker.
The Upload Location Map action has been moved to the side menu bar under Lab Setup to ease navigation.
The Template Workflows “Transition to the Next Workflow” action is now in English: “Enter Next Workflow.”
All Sanger chromatogram download options are easier to see and now include the option to download .phd formatted files.
The DNA template location field can be used to search for a reaction in a plate when creating a reaction plate.
To redo a Sanger reaction with a different chemistry, the chemistry can now be changed when either Requeuing for Reacting is chosen, or Edit Reactions from within a Reaction Set is selected.

Orders and Invoices
More efficient views and navigation have been implemented for Orders and Invoices.

When Orders are completed, the total number of samples and the number of results can be compared on the Update Order Status page to help identify repeated reactions.
A left-hand navigation link has been added for core lab customers to review both Submitted and Complete Invoices. The link is only active when invoicing is turned on in the settings.

System Management
Several new system settings now enable GSLE to be more adaptable at customer sites.

The top header bar time zone display can be disabled or configured for a unique time zone to support labs with customers in different time zones.
The User account profile can be configured to require certain fields. In addition, if Lab Group is not required, then Lab Groups are created automatically.
Projects within GSLE can be inactivated by all user roles to hide data not being used.

Application Programming Interface
Several additions to the self-documenting Application Programming Interface (API) have been made.

An upload option for Charge Codes within the Invoice feature was added.
Form API response objects are now more consistent.
API keys for user accounts can be generated in bulk.
Primers can be identified by either label or ID.
Events have been added. Events provide a mechanism to call scripts or send emails (beyond the current defaults) when system objects undergo workflow changes.

Presently, API's can only be activated on local, on-site, installations.

Friday, June 11, 2010

Levels of Quality

Next Generation Sequencing (NGS) data can produce more questions than answers. A recent LinkedIn discussion thread began with a simple question. “I would like to know how to obtain statistical analysis of data in a fastq file? number of High quality reads, "bad" reads....” This simple question opened a conversation about quality values, read mapping, and assembly. Obviously there is more to NGS data quality than simply separating bad reads from good ones.

Different levels of quality
Before we can understand data quality we need to understand what sequencing experiments measure and how the data are collected. In addition to sequencing genomes, many NGS experiments focus on measuring gene function and regulation by sequencing the fragments of DNA and RNA isolated and prepared in different ways. In these assays, complex laboratory procedures are followed to create specialized DNA libraries that are then sequenced in a massively parallel format.

Once the data are collected, they need to be analyzed in both common and specialized ways as defined by the particular application. The first step (primary analysis) converts image data, produced by different platforms into sequence data (reads). This step, specific to each sequencing platform, also produces a series of quality values (QVs), one value per base in a read, that define a probability that the base is correct. Next (secondary analysis), the reads and bases are aligned to known reference sequences, or, in the case of de novo sequencing, the data are assembled into contiguous units from which a consensus sequence is determined. The final steps (tertiary analysis) involve comparing alignments between samples or searching databases to get scientific meaning from the data.

In each step of the analysis process, the data and information produced can be further analyzed to get multiple levels of quality information that reflect how well instruments performed, if processes worked, or whether samples were pure. We can group quality analyses into three general levels: QV analysis, sequence characteristics, and alignment information.

Quality Value Analysis

Many of the data quality control (QC) methods are derived from Sanger sequencing where QVs could be used to identify low quality regions that could indicate mixtures of molecules or define areas that should be removed before analysis. QV correlations with base composition could also be used to sort out systematic biases, like high GC content, that affect data quality. Unlike Sanger sequencing, where data in a trace represent an average of signals produced by ensemble of molecules, NGS provides single data points collected from individual molecules arrayed on a surface. NGS QV analysis uses counting statistics to summarize the individual values collected from the several million reads produced by each experiment.

Examples of useful counting statistics include measuring average QVs by base position, box and whisker (BW) plots, histogram plots of QV thresholds, and overall QV distributions. Average QVs by base, BW plots, and QV thresholds are used to see how QVs trend across the reads. In most cases, these plots show the general trend that data quality decreases toward the 3’ ends of reads. Average QVs by base show each base’s QV with error bars indicating the values that are within one standard deviation of the mean. BW plots provide additional detail to show the minimum and maximum QV, the median QV and the lower and upper quartile QV values for each base. Histogram plots of QV thresholds count the number of bases below threshold QVs (10, 20, 30). This methods provides information about potential errors in the data and its utility in applications like RNA-seq or genotyping. Finally distributions of all QVs or the average QV per read can give additional indications of dataset quality.

QV analysis primarily measures sequencing and instrument run quality. For example, sharp drop offs in average QVs can identify systematic issues related to the sequencing reactions. Comparing data between lanes or chambers within a flowcell can flag problems with reagent flow or imaging issues within the instrument. In more detailed analyses, the coordinate positions for each read can be used to reconstruct quality patterns for very small regions (tiles) within a slide to reveal subtle features about the instrument run.

Sequence Characteristics
In addition to QV analysis we can look the sequences of the reads to get additional information. If we simply count the numbers of A’s, C’s, G’s, or T’s (or color values), at each base position, we can observe sequence biases in our dataset. Adaptor sequences, for example, will show sharp spikes in the data, whereas random sequences will give us an even distribution, or bias that reflects the GC content of the organism being analyzed. Single stranded data will often show a separation of the individual base lines; double stranded coverage should have equal numbers of AT and GC bases.

We can also compare each read to the other reads in the dataset to estimate the overall randomness, or complexity, of our data. Depending on the application, a low complexity dataset, one with a high number of exactly matching reads, can indicate PCR biases or a large number of repeats in the case of de novo sequencing. In other cases, like tag profiling assays, which measure gene expression by sequencing a small fragment from each gene, low complexity data are normal because highly expressed genes will contribute a large number of identical sequences.

Alignment Information
Additional sample and data quality can be measured after secondary analysis. Once the reads are aligned (mapped) to reference data sources we can ask questions that reflect both run and sample quality. The overall number of reads that can be aligned to all sources can also be used to estimate parameters related to library preparation and deposition of the molecules on the beads or slides used for sequencing. Current NGS processes are based on probabilistic methods for separating DNA molecules. Illumina, SOLiD, and 454 all differ with respect to their separation methods, but share the common feature that the highest data yield occurs when concentration of DNA is just right. The number of mappable reads can measure this property.

DNA concentration measures one aspect of sample quality. Examining which reference sources reads align to gives further information. For example, the goal of transcriptome analysis is to sequence non-ribosomal RNA (rRNA). Unfortunately rRNA is the most abundant RNA in a cell. Hence transcriptome assays involve steps to remove rRNA and a large number of rRNA reads in the data indicates problems with the preparation. In exome sequencing or other methods where certain DNA fragments are enriched, the ratio of exon (enriched) and non-exon (non-enriched) alignments can reveal how well the purification worked.

Read mapping, however, is not a complete way to measure data quality. High quality reads that do not match any reference data in the analysis pipeline could be from unknown laboratory contaminants, sequences like novel viruses or phage, or incomplete reference data. Unfortunately, the former case is the more common, so it is a good idea to include reference data for all ongoing projects in the analysis pipeline. Alignments to adaptor sequences can reveal issues related to preparation processes and PCR, and the positions of alignments can be used to measure DNA or RNA fragment lengths.

So Many Questions
The above examples provide a short tour of how NGS data can be analyzed to measure the quality of samples, experiments, protocol and instrument performances. NGS assays are complex and involve multistep lab procedures and data analysis pipelines that are specific to different kinds of applications. Sequence bases and their quality values provide information about instrument runs and some insights into samples and preparation quality. Additional, information are obtained after the data are aligned to multiple reference data sources. Data quality analysis is most useful when values are computed shortly after data are collected, and systems, like GeneSifter Lab and Analysis Editions, that automate these analyses are important investments if labs plan to be successful with their NGS experiments.