Bioinformatics is a big big issue
“If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow” was the closing statement in the lead editorial “Prepare for the deluge.”
Reminds me of something I said a few months back.
In the editorial, Nature Biotechnology (NBT) makes a number of important points starting with how the launch of the Roche/454 pyrosequencer in 2005 could generate as much data as more than 50 ABI capillary sequencers. Since that launch, we have seen new instruments emerge that are producing ever increasing amounts of data by orders of magnitude. Or as NBT put it “The overwhelming amounts of data being produced are the equivalent of taking a drink from a fire hose.”
It's like they read our web site (we ran the image below at the beginning of the year).
The volumes of data and new ways in which it must be worked with are creating many challenges. To begin, there is the conundrum of what to keep; do you keep raw images and processed reads? Or do you just keep the reads? If you keep raw images, the costs are significant. The cost of storing all that information must be considered in the context of the likelihood of whether you will ever need to go back to these data. We call this the data life cycle.
From raw images, the next challenge is the computational infrastructure needed to process reads and obtain meaningful information. This is a complex process that involves many steps and high performance computers. NBT made the accurate and important point that the instrument manufacturer only provide the software to analyze what comes off of the machine for common applications. A great deal of bioinformatics support is needed for downstream analysis once the initial data alignments or assemblies are completed. Also, standards for comparing data between instrument platforms are lacking. This makes it difficult to compare results from different instruments.
While more is needed in terms of bioinformatics support, being able to get tools for alignment and assembly is a good starting point and NBT lauded ABI’s SOLiD community program as a step in the right direction. This kind of approach is also needed by the other instrument vendors. Presently Illumina and Roche include their tools with an instrument purchase. This is fine for the laboratory, but it makes a hard problem harder for any researchers who might be getting data sets from different labs. This could lead to threads of frustration.
As the article continued, the "overwhelmed" scale increased to dire.
NBT stated:
“What all of this means is that for the foreseeable future, next-generation sequencing platforms may remain out of the hands of labs lacking the deep pockets needed for bioinformatics support.”
They also added,
“Thus, if the next-generation platforms are to truly democratize sequencing—bringing genomics out of the main sequencing centers and into the laboratories of single investigators or small academic consortia—much more effort needs to be expended in developing cost-effective software and data management solutions.”
NBT offered some solutions, including getting the instrument vendors to develop community based solutions, and encouraging the grant funding organizations to fund bioinformatics as much as they fund sequencing.
Is Next Gen for everyone?
The NBT editors made a lot of great points, but we do not see the world in as dire terms as they do. Yes, a great challenge to Next Gen and getting up and running with this equipment includes preparing for the informatics challenges that await. Next Gen is not Sanger. You cannot look at every read to figure out what your data mean and you will need a serious computational infrastructure to store, organize and work with the data. Also, not mentioned in the article, but incredibly important, you will need a laboratory information management system to organize your experimental information and track the many steps needed to prepare good DNA libraries for sequencing.
And, there are solutions.
Geospiza’s FinchLab combined with our Software as a Service (SaaS) delivery, provides immediate access to the necessary software and hardware infrastructure to run these new instruments.
FinchLab delivers the software infrastructure to support laboratory workflows for all the platforms, links the resulting data to samples, and - through a growing list of data analysis pipelines and visualization interfaces - provides the necessary bioinformatics for a wide range of sequencing applications. Further, our bioinformatics approach is community-based. We are working with the best tools as they emerge and are collaborating with multiple groups to advance additional research and development.
SaaS delivers the computing infrastructure on demand. With our SaaS model, the computer infrastructure is always available and grows with your needs. You do not have to set up a large computer system, or build a new building, or risk over or under investing to deal with the data.
With FinchLab, the vision of next-generation platforms truly democratizing sequencing can be realized.