Thursday, February 19, 2009

Three Themes from AGBT and ABRF Part II: The Bioinformatics Bottleneck

In my last post, I summarized the presentations and conversations regarding Next Gen Sequencing (NGS) challenges in terms of three themes: You have to pay attention to details in the laboratory, bioinformatics is a bottleneck, and the IT burden is significant. In that post, I discussed the issues related to the laboratory and how GeneSifter Lab Edition overcomes those challenges.

This post tackles the second theme: the bioinformatics bottleneck.

In the Sanger days, bioinformatics was really a challenge for only the highest throughput facilities like genome centers. In these labs, streamlined workflows (pipelines) were developed for the different kinds of sequencing (genomes, ESTs[expressed sequence tags, 1], SAGE [serial analysis of gene expression, 2]). Because, Sanger sequencing was high cost and low throughput, compared to NGS, the cost of developing the bioinformatics pipelines was low relative to the cost of collecting the data. Thus, large-scale projects such as whole genome shotgun sequencing, ESTs, or resequencing studies, could be supported by a handful of pipelines that changed infrequently. In addition, small-scale Sanger projects could be handled well by desktop software and services like NCBI BLAST.

NGS breaks the Sanger paradigm. A single NGS instrument has the same throughput as an entire warehouse of Sanger instruments. To illustrate this point in a striking way, we can look at dbEST - NCBI’s database of ESTs. Today, there are approximately 59 million reads in this database, representing the total accumulation of sequencing projects over a 10 year period. Considering that one run of an Illumina GA or SOLiD can produce between 80 and 180 million reads in week or two we can now, in a single week, produce up to three times more ESTs than we have seen deposited over the past 10 years. These numbers also dwarf the amount of data collected from other gene expression analysis systems like microarrays and sequencing techniques like SAGE.

The emergence of the bioinformatics bottleneck

The bioinformatics bottleneck is related to the fact that NGS platforms are general purpose; they collect sequence data. That’s it. Because they collect a lot of data very quickly, we can use sequences as data points for many different kinds of measurements. When we think this way, an extremely wide range of experiments can be conceived.

From sequencing complete genomes, to sampling genes in the environment, to measuring mutations in cancer, to understanding epigenomics, to measuring gene expression and the transcriptome, NGS applications are proliferating at a rapid pace. However, each experiment requires a specialized bioinformatics pipeline and the algorithms used within a bioinformatics pipelines must be tuned for the data produced from the different sequencing platforms and questions being asked. When these considerations are combined with other issues like what reference data to use for sequence comparisons the number of bioinformatics pipelines can grow in a combinatorial fashion.

The early recommendation is that each lab wanting to do NGS work needs to have a dedicated bioinformatics professional. In more than one talk, presenters even quantified bioinformatics support in terms of FTEs (full time equivalents) per instrument. Bioinformatics is needed in both the sequencing laboratory, to develop and maintain quality control pipelines, and in the research environment, to process (align) the data, mine the output for interesting features, and perform comparative analyses between datasets.

But this won’t work

It is clear that bioinformatics is critical to understanding the data being produced. However, the current recommendation that any group planning NGS experiments should also have a dedicated bioinformatician is impractical for several reasons.

First, the model of a bioinformatician for every lab is simply not scalable. Fundamentally, there are not enough people that understand the science, programming, statistics, and other resources such as different forms of reference data, algorithms, and data types needed to make sense of NGS data. We see plenty of evidence, in the literature and presentations, that there are many outstanding people doing this work and contributing to the community, the problem is that they already have jobs!

Even if we consider that the above model is workable, hiring people takes significant time, is expensive, and ongoing costs are going to be high. These time and cost investments only become reasonable when a significant number of experiments are planned. One or two instruments will produce between 25 and 50 runs worth of data per year. If you calculate instrument costs, reagents, salary, and overhead costs, you are quickly into many thousands of dollars per sample. Indeed, a theme expressed in the bioinformatics bottleneck is that bioinformatics is becoming the single largest ongoing cost of NGS. Add in the IT computer support (next post) and you better have a plan for running a lot more than 50 runs per year. Remember the first issue - good bioinformaticians with NGS analysis experience have jobs.

If you have access to bioinformatics support, or can hire an individual, that person will quickly become overwhelmed with work. The biggest reason is that the software infrastructures needed to quickly develop new pipelines, automate them, and deliver data in ways that can be consumed by non-programming scientists are typically lacking. The result is that scientific programming efforts generally turn into lengthy software development projects because without an infrastructure, the numbers and kinds of experiments quickly grow past beyond the capacity of a single individual.

So, What can be done?

Geospiza solves the bioinformatics challenge in multiple ways. GeneSifter Lab and Analysis editions provide a platform that delivers the complete infrastructures needed to deploy NGS data processing pipelines and deliver results through web-based interfaces. These systems include pipelines for many of the common NGS applications such as transcription analysis, small RNA detection, ChIP-Seq and other assays. The system architecture and accompanying API creates a framework to quickly add new pipelines and make the results available to biologists running the experiments.

For those with access to bioinformatics help, GeneSifter will make your team more productive because developers will be freed of the burden of having to create the automation and delivery infrastructure, enabling them to focus on new scientific programming problems. For those without access to such resources, we have many pipelines ready to go. Moreover, because we have a platform and the infrastructure already built, as well as deep bioinformatics experience, we can create and deliver new analysis pipelines quickly. Finally, our product development roadmap is well-aligned with the most common NGS assays which means we you can probably do your bioinformatics analysis today!


1. Adams M.D., Soares M.B., Kerlavage A.R., Fields C., Venter J.C., 1993. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat Genet 4, 373-380.

2. Velculescu V.E., Zhang L., Vogelstein B., Kinzler K.W., 1995. Serial analysis of gene expression. Science 270, 484-487.


Martin said...

Great post, Todd. Are you going to the NGS meeting in San Diego in march?

Todd Smith said...

Yes, Geospiza plans to attend.