Next Generation Sequencing (NGS) is a hot topic. As we kick off 2010, many themes continue. Data throughput is increasing, sequencing costs are decreasing, and NGS still requires extensive informatics support.
Throughput up, costs down
As sequencing throughput increases, the costs for collecting sequencing data decrease. Illumina is setting the pace for 2010 by announcing its latest sequencing instrument, the HiSeq2000. Illumina’s press release, news reports, and the blogosphere enthusiastically report on the instrument’s five fold increase in data throughput and ability to sequence an entire human genome in about one week for about $10,000.
What about the informatics?
This month’s reviews and editorials in Nature Reviews Genetics (NRG) and Nature Biotechnology (NBT), respectively, claim that the most significant NGS challenge continues to be dealing with the data. As pointed out in the NRG editorial, it is quite possible that the community will produce more sequence data this year than has been cumulatively produced in the past 10 years. The HiSeq, developments that will be announced by Applied Biosystems in February, and the coming single molecule sequencers support this. The editorial further makes the point that genome centers have the computing infrastructure to deal with the data, but the larger community of researchers, who could benefit from these technologies, do not. A similar observation was made at the end of the NBT review which pointed out that costs associated with downstream handling and processing of the data will possibly equal or exceed data collection costs.
The significance of the informatics challenge is that wide adoption of NGS technologies assumes that we have usable solutions for working with the data. These solutions go beyond simply getting a computer cluster with a sequencing instrument. To be useful, that cluster needs to reside in an adequately air conditioned room, be operated by people who know how to work with cluster hardware and software and can also optimize networks to manage the flow of data. Other individuals are needed who can write programs and scripts to process the data, work with multiple database technologies, and develop scalable user interfaces to visualize and navigate through the results and compare information between multiple samples and experiments.
The conversation about the informatics problem began with the introduction of NGS technologies. In 2008, Nature Methods (July) and NBT (October) published editorials speaking to the coming challenges. Later in 2009, Science published a new article about data intensive science. Previous FinchTalks have discussed the articles and their significance and the theme has remained the same; both the access to computing technologies and the skills needed to use the data are unavailable to the large numbers of researchers who need to use these technologies to remain competitive.
There is a solution
One solution to the informatics challenge created by NGS, and other data intensive technologies, is to make use of the immense Internet-based computing infrastructure that has been created by companies like Amazon, Google, Yahoo, and others. Also called Cloud Computing, Internet-based services remove many of the hardware and infrastructure barriers for utilizing high performance computing and storage technology. This message was delivered by the 2010 NBT kick off editorial and accompanying news feature, along with the next important message that software solutions also need to be adapted to cloud environments. Here the editorial, like many other descriptions of NGS informatics needs, falls short in that they only focus on alignment programs. Simply adapting alignment algorithms using technologies like Hadoop to employ Cloud-based high performance computing clusters is not a sufficient solution.
Aligning billions of reads to reference data quickly and accurately is clearly important. However it is just the first step of a complex analysis process. The subsequent steps of analyzing the billions of alignments to filter artifacts, identify true and new variation between sequences, discover alternative splice forms in transcripts, and compare data between samples are even more challenging.
Fortunately Geospiza understands the problem well. As our tag line, From Samples to ResultsTM, suggests, our lab and analysis systems focus on solving a complete set of problems that need to be addressed in order to do good science with NGS and other genetic analysis technologies.
Perhaps this is way we were the only software provider discussed in the NBT news feature, “Up in a cloud.”