FinchTalk: Resequencing

Showing posts with label Resequencing. Show all posts

Thursday, October 13, 2011

Personalities of Personal Genomes

"People say they want their genetic information, but they don’t." "The speaker's views of data return are frankly repugnant." These were some of the [paraphrased] comments and tweets expressed during Cold Spring Harbor's fourth annual conference entitled "Personal Genomes" held Sep 30 - Oct 2, 2011. The focus of which was to explore the latest technologies and approaches for sequencing genomes, exomes, and transcriptomes in the context of how genome science is, and will be, impacting clinical care.

The future may be close than we think

In previous years, the concept of personal genome sequencing as a way to influence medical treatment was a vision. Last year, the reality of the vision was evident through a limited number of examples. This year, several new examples were presented along with the establishment of institutional programs for genomic-based medicine. The driver being the continuing decreases in data collection costs combined with corresponding access to increasing amounts of data. According to Richard Gibbs (Baylor College of Medicine) we will have close to 5000 genomes completely sequenced by the end of this year and by the end of 2012, 30,000 complete genome sequences are expected.

The growth of genome sequencing is now significant enough that leading institutions are also beginning to establish guidelines for genomics-based medicine. Hence, an ethics panel discussion was held during the conference. The conversation about how DNA sequence data may be used has been an integral discussion since the beginning of the Genome Project. Indeed James Watson shared his lament for having to fund ethics research and directly asked the panel if they have done any good. There was a general consensus, from the panel, and audience members who have had their genomes sequenced, that ethics funding has helped by establishing genetic counseling and eduction practices.

However, as pointed out by some audience members, this ethics panel, like many others, focused too heavily on the risks for individuals and society having their genomic data. In my view, the discussion would have been more interesting and balanced if the panel included the individuals who are working outside of institutions with new approaches for understanding health. Organizations like 23andMe, Patients LIke Me, or the Genetic Alliance bring a very different and valuable perspective to the conversation.

Ethics was a fraction of the conference. The remaining talks at were organized into six sessions that covered personal cancer genomics, medically actionable genomics, personal genomes, rare diseases, and clinical implementations of personal genomics. The key messages from these presentations and posters was that, while genomics-based medical approaches have demonstrated success, much more research needs to be done before such approaches are mainstream.

For example, in the case of cancer genomics, whole genome sequences from tumor and normal cells can give a picture of point mutations and structural rearrangements, but these data need to be accompanied by exome sequences to get the high read depth needed to accurately detect the low levels of rare mutations that may be disregulating cell growth or conferring resistance to treatment. Yet, the resulting profiles of variants are still inadequate to fully understand the functional consequences of the mutations. For this, transcriptome profiling is needed, and that is just the start.

Once the data are collected they need to be processed in different ways, filtered, and compared within and between samples. Information from many specialized databases will be used in conjunction with statistical analyses to develop insights that can be validated through additional assays and measurements. Finally, a lab seeking to do this work, and return results back to patients, will also need to be certified, minimally by CLIA standards. For many groups this is significant undertaking, and good partners with experience and strong capabilities like PerkinElmer will be needed.

Further Reading

Nature Coverage, Oct 6 issue:

Secrets of the human genome disclosed

Nature readers flirt with genomics

Genomes on prescription
Other news and information:

At CSHL conference, researchers highlight importance of RNA-seq data to guide cancer treatment

Personal Genomes 2011 Meeting Site

Conference Tweets

Thursday, October 28, 2010

Bloginar: Making cancer transcriptome sequencing assays practical for the research and clinical scientist

A few weeks back we (Geospiza and Mayo Clinic) presented a research poster at BioMed Central’s Beyond the Genome conference. The objective was to present GeneSifter’s analysis capabilities and discuss the practical issues scientists face when using Next Generation DNA Sequencing (NGS) technologies to conduct clinically orientated research related to human heath and disease.

Abstract
NGS technologies are increasing in their appeal for studying cancer. Fully characterizing the more than 10,000 types and subtypes of cancer to develop biomarkers that can be used to clinically define tumors and target specific treatments requires large studies that examine specific tumors in 1000s of patients. This goal will fail without significantly reducing both data production and analysis costs so that the vast majority of cancer biologists and clinicians can conduct NGS assays and analyze their data in routine ways.

While sequencing costs are now inexpensive enough for small groups and individuals, beyond genome centers, to conduct the needed studies, the current data analysis methods need to move from large bioinformatics team approaches to automated methods that employ established tools in scalable and adaptable systems to provide standard reports and make results available for interactive exploration by biologists and clinicians. Mature software systems and cloud computing strategies can achieve this goal.

Poster Layout

Excluding the title, the poster has five major sections. The first section includes the abstract (above) and study parameters. In the work, we examined the RNA from 24 head and neck cancer biopsies from 12 individuals' tumor and normal cells.

The remaining sections (2-5), provide a background of NGS challenges, applications, high-level data analysis workflows, the analysis pipeline used in the work, the comparative analyses that need to be conducted, and practical considerations for groups seeking to do similar work. Much of section 2 has been covered in previous blogs and research papers.

Section 3: Secondary Analysis Explores Single Samples
NGS challenges are best known for the amount of data produced by the instruments. While this challenge should not be undervalued, it is over discussed. A far greater challenge lies in the complexity of data analysis. Once the first step (primary analysis, or basecalling) is complete, the resulting millions of reads must be aligned to several collections of reference sequences. For human RNA samples, these include the human genome, splice junction databases, and others to measure biological processes and filter out reads arising from artifacts related to sample preparation. Aligned data are further processed to create tables that annotate individual reads and compute quantitative values related to how the sample’s reads align (or cover) regions of the genome or span exon boundaries. If the assay measures sequence variation, alignments must be further processed to create variant tables.

Secondary analysis produces a collection of data in forms that can be immediately examined to understand overall sample quality and characteristics. High-level summaries indicate how many reads align to things we are interested in and not interested in. In GeneSifter, these summaries are linked to additional reports that show additional detail. Gene List reports, for example, show how the sample reads align within a gene’s boundary. Pictures in these reports are linked to Genesifter's Gene Viewer reports that provide even greater detail about the data with respect to each read’s alignment orientation and observed variation.

An important point about secondary analysis, however, is that it focuses on single sample analyses. As more samples are added to the project, the data from each sample must be processed through an assay specific pipeline. This point is often missed in the NGS data analysis discussion. Moreover, systems supporting this work must not only automate 100s of secondary analysis steps, they must also provide tools to organize the input and output data in project-based ways for comparative analysis.

Section 4: Tertiary Analysis in GeneSifter Compares Data Between Samples
The science happens in NGS when data are compared between samples in statistically rigorous ways. RNA sequencing makes it possible to compare gene expression, exon expression, and sequence variation between samples to identify differentially expressed genes, their isoforms, and whether certain alleles are differentially expressed. Additional insights are gained when gene lists can be examined in pathways and by ontologies. GeneSifter performs these activities in a user-friendly web-environment.

The poster's examples show how gene expression can be globally analyzed for all 24 samples, how a splicing index can distinguish gene isoforms occurring in tumor, but not normal cells, and how sequence variation can be viewed across all samples. Principal component analysis shows that genes in tumor cells are differentially expressed relative to normal cells. Genes highly expressed in tumor cells include those related to cell cycle and other pathways associated with unregulated cell growth. While these observations are not novel, they do confirm our expectations about the samples and being able to make such an observation with just a few clicks prevents working on costly misleading observations. For genes showing differential exon expression, GeneSifter provides ways to identify those genes and navigate to the alignment details. Similarly reports that show differential variation between samples can be filtered by multiple criteria in reports that link to additional annotation details and read alignments.

Section 5: Practical Considerations
Complete NGS data analysis systems seamlessly integrate secondary and tertiary analysis. Presently, no other systems are as complete as GeneSifter. There are several reasons why this is the case. First, a significant amount of software must be produced and tested to create such a system. From complex data processing automation, to advanced data queries, to user interfaces that provide interactive visualizations and easy data access, to security, software systems must employ advanced technologies and take years to develop with experienced teams. Second, meeting NGS data processing requirements demands that computer systems be designed with distributable architectures that can support cloud environments in local and hosted configurations. Finally, scientific data systems must support both predefined and ad hoc query capabilities. The scale of NGS applications means that non-traditional approaches must be used to develop data persistence layers that can support a variety of data access methods and, for bioinformatics, this is a new problem.

Because Geospiza has been doing this kind of work for over a decade and could see the coming challenges, we’ve focused our research and development in the right ways to deliver a feature rich product that truly enables researchers to do high quality science with NGS.

Enjoy the poster.

Wednesday, July 14, 2010

Increasing the Scale of Deep Sequencing Data Analysis with BioHDF

Last month, at the Department of Energy's Sequencing, Finishing and Analysis in the Future meeting, I presented Geospiza's product development work and how BioHDF is contributing to scalable infrastructures. The abstract, presentation, and link to the presentation are posted below.

Abstract

Next Generation DNA Sequencing (NGS) technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. Presently, the value of NGS technology has been largely demonstrated on individual sample analyses. The full potential of NGS will be realized when it can be used in multisample experiments that involve different measurements and include replicates, and controls to make valid statistical comparisons. Arguably, improvements in current technology, and soon to be available “third” generation systems, will make it possible to simultaneously measure 100’s to1000’s of individual samples in single experiments to study transcription, alternative splicing, and how sequences vary between individuals and within expressed genes. However, several bioinformatics systems challenges must be overcome to effectively manage both the volumes of data being produced and the complexity of processing the numerous datasets that will be generated.

Future bioinformatics applications need to be developed on common standard infrastructures that can reduce overall data storage, increase data processing performance, integrate information from multiple sources and are self-describing. HDF technologies meet all of these requirements, have a long history, and are widely used in data-intensive science communities. They consist of general data file formats, software libraries and tools for manipulating the data. Compared to emerging standards such as the SAM/BAM formats, HDF5-based systems demonstrate improved I/O performance and improvedmethods to reduce data storage. HDF5 isalso more extensible and can support multiple data indexes and store multiple data types. For these reasons, HDF5 and its BioHDF implementation are well qualified as standards for implementing data models in binary formats to support the next generation of bioinformatics applications. Through this presentation we will demonstrate BioHDF's latest features in NGS applications that target transcription analysis and resequencing.

SciVee Video

Acknowledgments

Contributing Authors: Todd Smith (1), Christopher E Mason (2), Paul Zumbo (2), Mike Folk (3), Dana Robinson (3), Mark Welsh (1), Eric Smith (1), N. Eric Olson (1),

1. Geospiza, Inc. 100 West Harrison N. Tower 330, Seattle WA 98119 2. Department of Physiology and Biophysics, Weil Cornell Medical College, 1305 York Ave., New York NY, 10021 3. The HDF Group, 1901 S. First St., Champaign IL 61820

Funding: NIH: STTR HG003792