Yesterday we released news about new funding from NIH for a project to work on ways to improve how variations between DNA sequences are detected using Next Generation Sequencing (NGS) technology. The project emphasizes detecting rare variation events to improve cancer diagnostics, but the work will support a diverse range of resequencing applications.
Why is this important?
In October 2008, the U.S. News and World Report published an article by Bernadine Healy, former head of NIH. The tag line “understanding the genetic underpinnings of cancer is a giant step toward personalized medicine,” (1) underscores how the popular press views the promise of recent advances in genomics technology in general, and the progress toward understanding the molecular basis of cancer. In the article, Healy presents a scenario where, in 2040, a 45-year-old woman, who has never smoked, develops lung cancer. She undergoes outpatient surgery, and her doctors quickly scrutinize the tumor’s genes and use a desktop computer to analyze the tumor genomes, and medical records to create a treatment plan. She is treated, the cancer recedes and subsequent checkups are conducted to monitor tumor recurrence. Should a tumor be detected, her doctor would quickly analyze the DNA of a few of the shed tumor cells and prescribe a suitable next round of therapy. The patient lives a long happy life, and keeps her hair.
This vision of successive treatments based on genomic information is not unrealistic, claims Healy, because we have learned that while many cancers can look homogeneous in terms of morphology and malignancy they are indeed highly complex and varied when examined at the genetic level. The disease of cancer is in reality a collection of heterogeneous diseases that, even for common tissues like the prostate, can vary significantly in terms of onset and severity. Thus, it is often the case that cancer treatments, based on tissue type, fail, leaving patients to undergo a long painful process of trial and error therapies with multiple severely toxic compounds.
Because cancer is a disease of genomic alterations, understanding the sources, causes, and kinds of mutations, and their connection to specific types of cancer, and how they may predict tumor growth is worthwhile. The human cancer genome project (2) and initiatives like the international cancer genome consortium (3) have demostrated this concept. The kinds of mutations found in tumor populations, thus far by NGS, include single nucleotide polymorphisms (SNPs), insertions and deletions, and small structural copy number variations (CNVs) (4, 5). From early studies it is clear that a greater amount of genomic information will be needed to make Healy's scenario a reality. Next generation sequencing (NGS) technologies will drive this next phase of research and enable our deeper understanding.
The great potential for the clinical applications of new DNA sequencing technologies comes from their highly sensitive ability to assess genetic variation. However, to make these technologies clinically feasible, we must assay patient samples at far higher rates than can be done with current NGS procedures. Today, the experiments applying NGS, in cancer research have investigated small numbers of samples in great detail, in some cases comparing entire genomes from tumor and normal cells from a single patient (6-8). These experiments show, that when a region is sequenced with sufficient coverage, numerous mutations can be identified.
To move NGS technologies into clinical use many costs must decrease. Two ways costs can be lowered are to increase sample density and reduce the number of reads needed per sample. Because cost is a function of turnaround time and read coverage, and read coverage is a function of the signal to noise ratio, assays with a higher background noise, due to errors in the data, will require higher sampling rates to detect true variation and be more expensive. To put this in context, future cancer diagnostic assays will likley need to look at over 4000 exons per test. In cases like bladder cancer, or cancers where stool or blood are sampled, non-invasive tests will need to detect variations in one out of 1000 cells. Thus it is extremely important that we understand signal/noise ratios and to be able to calculate read depth in a reliable fashion.
Currently we have a limited understanding of how many reads are needed to detect a given rare mutation. Detecting mutations depends on a combination of sequencing accuracy and depth of coverage. The signal (true mutations) to noise (false mutations, hidden mutations) depends on how many times we see a correct result. Sequencing accuracy is affected by multiple factors that include sample preparation, sequence context, sequencing chemistry, instrument accuracy, and basecalling software. The current depth-of-coverage calculations are based on an assumption that sampling is random, which is not valid in the real world. Corrections will have to be applied to adjust for real-world sampling biases that affect read recovery and sequencing error rates (9-11).
Developing clinical software systems that can work with NGS technologies, to quickly and accurately detect rare mutations, requires a deep understanding of the factors that affect the NGS data collection and interpretation. This information needs to be integrated into decision control systems that can, through a combination of computation and graphical displays, automate and aid a clinician’s ability to verify and validate results. Developing such systems are major undertakings involving a combination of research and development in the areas of laboratory experimentation, computational biology, and software development.
Positioned for Success
Detecting small genetic changes in clinical samples is ambitious. Fortunately, Geospiza has the right products to deliver on the goals of the research. GeneSifter Lab Edition handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. The data management and analysis server make the system scalable through a distributed architecture. Combined with GeneSifter Analysis Edition, a complete platform is created to rapidly prototype new data analysis workflows needed to test new analysis methods, experiment with new data representations, and iteratively develop data models to integrate results with experimental details.
1. Healy, 2008. "Breaking Cancer's Gene Code - US News and World Report" http://health.usnews.com/articles/health/cancer/2008/10/23/breaking-cancers-gene-code_print.htm
2. Working Group, 2005. "Recommendation for a Human Cancer Genome Project" http://www.genome.gov/Pages/About/NACHGR/May2005NACHGRAgenda/ReportoftheWorkingGrouponBiomedicalTechnology.pdf
3. ICGC, 2008. "International Cancer Genome Consortium - Goals, Structure, Policies &Guidelines - April 2008" http://www.icgc.org/icgc_document/
4. Jones S., et. al., 2008. "Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses." Science 321, 1801.
5. Parsons D.W., et. al., 2008. "An Integrated Genomic Analysis of Human Glioblastoma Multiforme." Science 321, 1807.
6. Campbell P.J., et. al., 2008. "Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing." Proc Natl Acad Sci U S A 105, 13081-13086.
7. Greenman C., et. al., 2007. "Patterns of somatic mutation in human cancer genomes." Nature 446, 153-158.
8. Ley T.J., et. al., 2008. "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome." Nature 456, 66-72.
9. Craig D.W., et. al., 2008. "Identification of genetic variants using bar-coded multiplexed sequencing." Nat Methods 5, 887-893.
10. Ennis P.D.,et. al., 1990. "Rapid cloning of HLA-A,B cDNA by using the polymerase chain reaction: frequency and nature of errors produced in amplification." Proc Natl Acad Sci U S A 87, 2833-2837.
11. Reiss J., et. al., 1990. "The effect of replication errors on the mismatch analysis of PCR-amplified DNA." Nucleic Acids Res 18, 973-978.