Sunday, November 9, 2008

Next Gen-Omics

Advances in Next Gen technologies have led to a number of significant papers in recent months, highlighting their potential to advance our understanding of cancer and human genetics (1-3). These and the other 100's of papers demonstrate the value of Next Gen sequencing. The work completed thus far has been significant, but much more needs to be done to make these new technologies useful for a broad range of applications. Experiments will get harder.

While much of the discussion in the press focuses on rapidly sequencing human genomes for low cost as part of the grail of personalized genomics (4), a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of "omics" have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry.

As noted in the last post we are developing proposals to further advance the state-of-the-art in working with Next Gen data sets. In one of those proposals, Geospiza will develop novel approaches to work with data from applications of Next Gen sequencing technologies that are being developed study the omics of DNA transcription and gene expression.

Toward furthering our understanding of gene expression, Next Gen DNA sequencing is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw experimental conclusions. Recall the three phases of data analysis.

However, to be useful these data sets need to come from experiments that measure what we think they should measure. The data must be high quality and free of artifacts. In order to compare quantitative information between samples, the data sets must be refined and normalized so that biases introduced through sample processing are accounted for. Thus, a fundamental challenge to performing these kinds of experiments is working with the data sets that are produced. In this regard numerous challenges exist.

The obvious ones relating to data storage and bioinformatics are being identified in both the press and scientific literature (5,6). Other, less published, issues include a lack of:
  • standard methods and controls to verify datasets in the context of their experiments,
  • standardized ways to describe experimental information and
  • standardized quality metrics to compare measurements between experiments.
Moreover data visualization tools and other user interfaces, if available, are primitive and significantly slow that pace at which a researcher can work with the data. Finally, information technology (IT) infrastructures that can integrate the system parts dealing with sample tracking, experimental data entry, data management, data processing and result presentation are incomplete.

We will tackle the above challenges by working with the community to develop new data analysis methods that can run independently and within Geospiza's FinchLab. FinchLab handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible order interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. Geospiza's hosted (Software as a Service [SaaS]) delivery models remove additional IT barriers.

FinchLab's data management and analysis server make the system scalable through a distributed architecture. The current implementation of the analysis server creates a complete platform to rapidly prototype new data analysis workflows and will allow us to quickly devise and execute feasibility tests, experiment with new data representations, and iteratively develop the needed data models to integrate results with experimental details.


1. Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66-72 (2008).

2. Wang, J., Wang, W., Li, R., Li, Y., et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008).

3. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008).

4. My genome. So what? Nature 456, 1 (2008).

5. Prepare for the deluge. Nature Biotechnology 26, 1099 (2008).

6. Byte-ing off more than you can chew. Nature Methods 5, 577 (2008).

No comments: