Thursday, November 20, 2008

Introducing GeneSifter

Today, Geospiza announced the acquisition of the award-winning GeneSifter microarray data analysis product. This news has significant implications for Geospiza’s current and new customers. With GeneSifter and FinchLab, Geospiza will deliver complete end to end systems for data intensive genetic analysis applications like Next Gen sequencing and microarrays.

As an example, let's consider transcriptomics or gene expression. One goal of such experiments is to compare the relative gene expression between cells to see how different genes are up or down regulated as the cells change over time or respond to some sort of treatment.

The general process, whether it involves microarrays or Next Gen sequencing, is to measure the number of RNA molecules for a given gene, either over a period of time or after different treatments. Laboratory processes create the molecules to assay, the molecules are measured, data are collected, and we process the data to produce tables of information. These tables are then compared with one another to identify genes that are differentially expressed. With the gene expression results in hand, one can delve deeper by utilizing other databases like Entrez Gene or pathway sites to learn about gene function and gain insights.

From a systems perspective, you need a LIMS to define sample information and keep track of workflow steps and the data generated at the bench. You will also need to track which samples are on a slide, or lane, or well when the data are collected. You will need to store and organize the data by sample. Then, you will need to analyze the data through multiple programs in a pipelined process (filter, align ...) to produce information, like gene lists, that can be compared for each sample. You may want to review this information to see that your experiments are on track and then, if they are, you will want to compare the gene lists from different experiments to tell a story.

FinchLab, combined with Geospiza’s hosted Software as a Service (SaaS) delivery, solves challenges related to IT, LIMS, and the core data analysis. GeneSifter completes the process by delivering a software solution that lets you compare your gene lists. GeneSifter provides information about the relative gene expression between samples and links gene information to key public resources to uncover additional details.

It's an exciting time for those in the genetic analysis and genomics fields. New high throughput data collection technologies are giving scientists the ability to interrogate systems and understand biology in a whole new way. As we come to the end of 2008 and think about 2009, Geospiza is excited to think about how we will integrate and extend our products to further develop end to end systems for a wide variety of genomics applications that target basic and clinical research to help us improve human health and well being.



Sunday, November 9, 2008

Next Gen-Omics

Advances in Next Gen technologies have led to a number of significant papers in recent months, highlighting their potential to advance our understanding of cancer and human genetics (1-3). These and the other 100's of papers demonstrate the value of Next Gen sequencing. The work completed thus far has been significant, but much more needs to be done to make these new technologies useful for a broad range of applications. Experiments will get harder.

While much of the discussion in the press focuses on rapidly sequencing human genomes for low cost as part of the grail of personalized genomics (4), a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of "omics" have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry.

As noted in the last post we are developing proposals to further advance the state-of-the-art in working with Next Gen data sets. In one of those proposals, Geospiza will develop novel approaches to work with data from applications of Next Gen sequencing technologies that are being developed study the omics of DNA transcription and gene expression.

Toward furthering our understanding of gene expression, Next Gen DNA sequencing is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw experimental conclusions. Recall the three phases of data analysis.

However, to be useful these data sets need to come from experiments that measure what we think they should measure. The data must be high quality and free of artifacts. In order to compare quantitative information between samples, the data sets must be refined and normalized so that biases introduced through sample processing are accounted for. Thus, a fundamental challenge to performing these kinds of experiments is working with the data sets that are produced. In this regard numerous challenges exist.

The obvious ones relating to data storage and bioinformatics are being identified in both the press and scientific literature (5,6). Other, less published, issues include a lack of:
  • standard methods and controls to verify datasets in the context of their experiments,
  • standardized ways to describe experimental information and
  • standardized quality metrics to compare measurements between experiments.
Moreover data visualization tools and other user interfaces, if available, are primitive and significantly slow that pace at which a researcher can work with the data. Finally, information technology (IT) infrastructures that can integrate the system parts dealing with sample tracking, experimental data entry, data management, data processing and result presentation are incomplete.

We will tackle the above challenges by working with the community to develop new data analysis methods that can run independently and within Geospiza's FinchLab. FinchLab handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible order interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. Geospiza's hosted (Software as a Service [SaaS]) delivery models remove additional IT barriers.

FinchLab's data management and analysis server make the system scalable through a distributed architecture. The current implementation of the analysis server creates a complete platform to rapidly prototype new data analysis workflows and will allow us to quickly devise and execute feasibility tests, experiment with new data representations, and iteratively develop the needed data models to integrate results with experimental details.

References

1. Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66-72 (2008).

2. Wang, J., Wang, W., Li, R., Li, Y., et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008).

3. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008).

4. My genome. So what? Nature 456, 1 (2008).

5. Prepare for the deluge. Nature Biotechnology 26, 1099 (2008).

6. Byte-ing off more than you can chew. Nature Methods 5, 577 (2008).

Tuesday, November 4, 2008

November is about elections and grants

Halloween is over, and today is election day, go vote! Then, come back and read about what we planning for our next steps in Next Gen sequencing.

This month we are preparing SBIR proposals to target some of the real challenges researchers face when working with next-generation (Next Gen) sequence data.

The first will deal with issues related to detecting rare variants in cancer. With our collaborators we plan to develop control samples to detect different kinds of mutations and use the samples and data produced to develop new software tools and interfaces for measuring results. While the work will focus on cancer research, detecting rare variants in large datasets is a common problem with many applications.

The second proposal will deal with improving tools and methods to validate datasets from quantitative assays that utilize Next Gen data. When you run an RNA-Seq, ChIP-Seq, or Other-Seq experiment, where you collect numerous molecular tags from RNA or DNA in your research samples, how do you know that your data represent interesting biology and are free of artifacts? Pulling the relevent features, that distinguish biological reality from experimental artifact, out of datasets comprised of millions and millions of reads can be a real problem.

The above projects will leverge Geospiza's, Geospiza customers, and community experience to develop novel features and integrated resources that will be added to FinchLab, new products, and open-source contributions to make your work easier. If you are interested in learning more about these projects and (or) participating in working with us to takle the next-generation of hard problems contact us.