Friday, October 23, 2009

Yardsticks and Sequencers

A recent question to the ABRF discussion forum, about quality values and Helicos data, led to an interesting conversation about having yardsticks to compare between Next Generation Sequencing (NGS) platforms and the common assays that are run on those platforms.

It also got me thinking, just how well can you measure things with those free wooden yardsticks you get at hardware stores and home shows?


The conversation started with a question asking about what kind of quality scoring system could be applied to Helicos data. Could something similar to Phred and AB files be used?

A couple of answers were provided. One referred to the recent Helicos article in Nature Biotechnology and pointed out that Helicos has such a method. This answer also addressed the issue that quality values (QVs) need to be tuned for each kind of instrument.

Another answer, from a core lab director with a Helcos instrument, pointed out many more challenges that exist with comparing data from different applications and how software in this area is lacking. He used the metaphor of the yardstick to make the point that researchers need systematic tools and methods to compare data and platforms.

What's in a Yardstick?

I replied to the thread noting that we've been working with data from 454, Illumina GA, SOLiD and Helicos and there are multiple issues that need to be addressed in developing yardsticks to compare data from different instruments for different experiments (or applications).

At one level, there is the instrument and the data that are produced and the question is can have a standard quality measure? In Phred, we need to recall that each instrument needed to be calibrated so that quality values would be useful and equivalent across chemistries and platforms (primers, terminators, bigdye, gel, cap, AB models, MegaBACE ...). Remember phredpar.dat? Because the data were of a common type - an electropherogram - we could more or less use a single tool and define a standard. Even then, other tools (LifeTrace, KB basecaller, and LongTrace) emerged and computed standardized quality values differently. So, I would argue that we think we have a measure, but it is not the standard we think it is.

By analogy, each NGS instrument uses a very different method to generate sequences, so each platform will have a unique error profile. The good news is that quality values, as transformed error probabilities, make it possible to compare output from different instruments in terms of confidence. The bad news is that if you do not know how the error probability is computed, or you do not have enough data (control, test) to calibrate the system, error probabilities are not useful. Add to that, the fact that the platforms are undergoing rapid change as they improve chemistry, change hardware and software to increase throughput and accuracy. So, for the time being we might have yardsticks, but they have variable lengths.

The next levels deal with experiments. As noted ChiP-Seq, RNA-Seq, Me-Seq, Re-Seq, and your favorite-Seq all measure different things and we are just learning about how errors and other artifacts interfere with how well the data produced actually measure what the experiment intended to measure. Experiment level methods need to be developed so that ChiP-Seq from one platform can be compared to ChiP-Seq from another platform and so on. However, the situation is not dire because in the end, DNA sequences are the final output and for many purposes the data produced are much better now then they have been in the past. As we push sensitivity, the issues already discussed become very relevant.

As a last point, the goal many researchers will have is to layer data from on experiment on another experiment, correlate ChIP-Seq with RNA-Seq for example and to do that you not only need to have quality measures for data, sample, experiment, you also need ways to integrate all of this experimental information with already published data. There is a significant software challenge ahead and, as pointed out, cobbling solutions together is not a long term feasible answer. The datasets are getting to big and complex and at the same time the archives are busting with data generated by others.

So what does this have to do with yardsticks?

Back to yardsticks. Those cheap wooden yardstick expand and contract with temperature and humidity, so at different times a yardstick's measurements will change. This change is the uncertainty of the measurement (see additional reading below), which defines the precision of our measuring device. If I want a quick estimate of how tall my dog stands, I would happily use the wooden yardstick. However, if I want to measure something to within a 32nd of an inch or millimeter, I would use a different tool. The same rules apply to DNA sequencing, for many purposes the reads are good enough and data redundancy overcomes errors, but as we push sensitivity and want to measure changes in fewer molecules, discussions about how to compute QVs and annotate data, so that we know which measuring device was used, become very important.

Finally, I often see in the literature, company brochures, and hear in conversation that refer to QVs as Phred scores. Remember: Only Phred makes Phred QVs - everything else is Phred-like, but only if it is a -10log(P) transformation of an error probability.

Additional Reading:

Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Monday, October 19, 2009

Sneak Peak: Gene Expression Profiling in Cancer Research: Microarrays, Tag Profiling and Whole Transcriptome Analysis

Join us this Wednesday (October 21, 2009 10:00 am Pacific Daylight Time) to learn about how GeneSifter is used to measure transcript abundance as well as discover novel transcripts and isoforms of expressed genes in cancer.


Current gene expression technologies such as Microarrays and Next Generation Sequencing applications allow biomedical researchers to examine the expression of tens of thousands of genes at once, giving researchers the opportunity to examine expression for an entire genome, where previously they could only look at a handful of genes at one time.

In addition, NGS applications such as Tag Profiling and Whole Transcriptome Analysis can identify novel transcripts and characterize both known and novel splice junctions. These applications allow characterization of the cancer transcriptome at an unprecedented level.

This presentation will provide an overview of the gene expression data analysis process for these applications with an emphasis on identification of differentially expressed genes, identification of novel transcripts and characterization of alternative splicing. Using data drawn from the GEO data repository and the Short Read Archive gene expression in Melanoma will be examined using Microarrays, NGS Tag Profiling and NGS Whole Transcriptome Analysis data.

Tuesday, October 13, 2009

Super Computing 09 and BioHDF

Next month, Nov 16-20, we will be in Portland for Super Computing 09 - SC09. Join us at a Birds of a Feather (BoF) session to learn about developing bioinformatics applications with BioHDF. The session will be Wed. Nov 18 at 12:15 pm in room D139-140.

Developing Bioinformatics Applications with BioHDF

In this session we will present how HDF5 can be used to work with large volumes of DNA sequence data. We will cover the current state of bioinformatics tools that utilize HDF5 and proposed extensions to the HDF5 library to create BioHDF. The session will include a discussion of requirements that are being be considered to develop a data models for working with DNA sequence alignments to measure variation within sets of DNA sequences.

HDF5 is an open-source technology suite for managing diverse, complex, high-volume data in heterogeneous computing and storage environments. The BioHDF project is investigating the use of HDF5 for working with very large scientific datasets. HDF5 provides a hierarchical data model, binary file format, and collection of APIs supporting data access. BioHDF will extend HDF5 to support DNA sequencing requirements.

Initial prototyping of BioHDF has demonstrated clear benefits. Data can be compressed and indexed in BioHDF to reduce storage needs and enable very rapid (typically, few millisecond) random access into these sequence and alignment datasets, essentially independent of the overall HDF5 file size. Additional prototyping activities we have identified key architectural elements and tools that will form BioHDF.

The BoF session will include a presentation of the current state of BioHDF and proposed implementations to encourage discussion of future directions.

Thursday, October 8, 2009

Resequencing and Cancer

Yesterday we released news about new funding from NIH for a project to work on ways to improve how variations between DNA sequences are detected using Next Generation Sequencing (NGS) technology. The project emphasizes detecting rare variation events to improve cancer diagnostics, but the work will support a diverse range of resequencing applications.

Why is this important?

In October 2008, the U.S. News and World Report published an article by Bernadine Healy, former head of NIH. The tag line “understanding the genetic underpinnings of cancer is a giant step toward personalized medicine,” (1) underscores how the popular press views the promise of recent advances in genomics technology in general, and the progress toward understanding the molecular basis of cancer. In the article, Healy presents a scenario where, in 2040, a 45-year-old woman, who has never smoked, develops lung cancer. She undergoes outpatient surgery, and her doctors quickly scrutinize the tumor’s genes and use a desktop computer to analyze the tumor genomes, and medical records to create a treatment plan. She is treated, the cancer recedes and subsequent checkups are conducted to monitor tumor recurrence. Should a tumor be detected, her doctor would quickly analyze the DNA of a few of the shed tumor cells and prescribe a suitable next round of therapy. The patient lives a long happy life, and keeps her hair.

This vision of successive treatments based on genomic information is not unrealistic, claims Healy, because we have learned that while many cancers can look homogeneous in terms of morphology and malignancy they are indeed highly complex and varied when examined at the genetic level. The disease of cancer is in reality a collection of heterogeneous diseases that, even for common tissues like the prostate, can vary significantly in terms of onset and severity. Thus, it is often the case that cancer treatments, based on tissue type, fail, leaving patients to undergo a long painful process of trial and error therapies with multiple severely toxic compounds.

Because cancer is a disease of genomic alterations, understanding the sources, causes, and kinds of mutations, and their connection to specific types of cancer, and how they may predict tumor growth is worthwhile. The human cancer genome project (2) and initiatives like the international cancer genome consortium (3) have demostrated this concept. The kinds of mutations found in tumor populations, thus far by NGS, include single nucleotide polymorphisms (SNPs), insertions and deletions, and small structural copy number variations (CNVs) (4, 5). From early studies it is clear that a greater amount of genomic information will be needed to make Healy's scenario a reality. Next generation sequencing (NGS) technologies will drive this next phase of research and enable our deeper understanding.

Project Synopsis

The great potential for the clinical applications of new DNA sequencing technologies comes from their highly sensitive ability to assess genetic variation. However, to make these technologies clinically feasible, we must assay patient samples at far higher rates than can be done with current NGS procedures. Today, the experiments applying NGS, in cancer research have investigated small numbers of samples in great detail, in some cases comparing entire genomes from tumor and normal cells from a single patient (6-8). These experiments show, that when a region is sequenced with sufficient coverage, numerous mutations can be identified.

To move NGS technologies into clinical use many costs must decrease. Two ways costs can be lowered are to increase sample density and reduce the number of reads needed per sample. Because cost is a function of turnaround time and read coverage, and read coverage is a function of the signal to noise ratio, assays with a higher background noise, due to errors in the data, will require higher sampling rates to detect true variation and be more expensive. To put this in context, future cancer diagnostic assays will likley need to look at over 4000 exons per test. In cases like bladder cancer, or cancers where stool or blood are sampled, non-invasive tests will need to detect variations in one out of 1000 cells. Thus it is extremely important that we understand signal/noise ratios and to be able to calculate read depth in a reliable fashion.

Currently we have a limited understanding of how many reads are needed to detect a given rare mutation. Detecting mutations depends on a combination of sequencing accuracy and depth of coverage. The signal (true mutations) to noise (false mutations, hidden mutations) depends on how many times we see a correct result. Sequencing accuracy is affected by multiple factors that include sample preparation, sequence context, sequencing chemistry, instrument accuracy, and basecalling software. The current depth-of-coverage calculations are based on an assumption that sampling is random, which is not valid in the real world. Corrections will have to be applied to adjust for real-world sampling biases that affect read recovery and sequencing error rates (9-11).

Developing clinical software systems that can work with NGS technologies, to quickly and accurately detect rare mutations, requires a deep understanding of the factors that affect the NGS data collection and interpretation. This information needs to be integrated into decision control systems that can, through a combination of computation and graphical displays, automate and aid a clinician’s ability to verify and validate results. Developing such systems are major undertakings involving a combination of research and development in the areas of laboratory experimentation, computational biology, and software development.

Positioned for Success

Detecting small genetic changes in clinical samples is ambitious. Fortunately, Geospiza has the right products to deliver on the goals of the research. GeneSifter Lab Edition handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. The data management and analysis server make the system scalable through a distributed architecture. Combined with GeneSifter Analysis Edition, a complete platform is created to rapidly prototype new data analysis workflows needed to test new analysis methods, experiment with new data representations, and iteratively develop data models to integrate results with experimental details.


Press Release: Geospiza Awarded SBIR Grant for Software Systems for Detecting Rare Mutations

1. Healy, 2008. "Breaking Cancer's Gene Code - US News and World Report"

2. Working Group, 2005. "Recommendation for a Human Cancer Genome Project"

3. ICGC, 2008. "International Cancer Genome Consortium - Goals, Structure, Policies &Guidelines - April 2008"

4. Jones S., et. al., 2008. "Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses." Science 321, 1801.

5. Parsons D.W., et. al., 2008. "An Integrated Genomic Analysis of Human Glioblastoma Multiforme." Science 321, 1807.

6. Campbell P.J., et. al., 2008. "Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing." Proc Natl Acad Sci U S A 105, 13081-13086.

7. Greenman C., et. al., 2007. "Patterns of somatic mutation in human cancer genomes." Nature 446, 153-158.

8. Ley T.J., et. al., 2008. "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome." Nature 456, 66-72.

9. Craig D.W., et. al., 2008. "Identification of genetic variants using bar-coded multiplexed sequencing." Nat Methods 5, 887-893.

10. Ennis P.D.,et. al., 1990. "Rapid cloning of HLA-A,B cDNA by using the polymerase chain reaction: frequency and nature of errors produced in amplification." Proc Natl Acad Sci U S A 87, 2833-2837.

11. Reiss J., et. al., 1990. "The effect of replication errors on the mismatch analysis of PCR-amplified DNA." Nucleic Acids Res 18, 973-978.

Tuesday, October 6, 2009

From Blue Gene to Blue Genome? Big Blue Jumps In with DNA Transistors

Today, IBM announced that are getting into the DNA sequencing business and race for the $1,000 dollar genome by winning a research grant to explore new sequencing technology based on nanopore devices they call DNA transistors.

IBM news travels fast. Genome Web and The Daily Scan covered the high points and Genetic Future presented a skeptical analysis of the news. You can read the original news at the IBM site, and enjoy a couple of videos.

A NY Times article a listed a couple of facts that I thought were interesting: First, IBM becomes the 17th company to pursue the next next-gen (or third-generation) technology. Second, according to George Church, in the past five years the cost of collecting DNA sequence data has decreased by 10 fold annually and is expected to continue decreasing at a similar pace for the next few years.

But what does this all mean?

It is clear from this and other news that DNA sequencing is fast becoming a standard way to study genomes, gene expression, and measure genetic variation. It is also clear the while the cost of DNA sequencing is decreasing at a fast rate, the amount of data being produced is increasing at a similarly fast rate.

While some of the articles above discussed the technical hurdles nanopore sequencing must overcome, none discussed the real challenges researchers face today with using the data. The fact is, for most groups, the current next-gen sequencers are under utilized because the volumes of data combined with the complexity of data analysis has created a significant bioinformatics bottleneck.

Fortunately, Geospiza is clearing data analysis barriers by delivering access to systems that provide standard ways of working with the data and visualizing results. For many NGS applications, groups can upload their data to our servers, align reads to reference data sources, and compare the resulting output across multiple samples in efficient and cost effective processes.

And, because we are meeting the data analysis challenges for all of the current NGS platforms, we'll be ready for whatever comes next.