Thursday, June 11, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part II: Background

Next Generation DNA Sequencing (NGS) technologies will continue to produce ever increasing amounts of complex data. In the old world of electrophoresis-based sequencing (a.k.a. Sanger) we could effectively manage projects and data growth by relying largely on improvements in computer hardware. NGS breaks this paradigm, we now must think more deeply about software and hardware systems as a whole solution. Bringing technologies, like HDF5, into the bioinformatics world provides a path to such solutions. In this installment of the bloginar (started with the previous post), the power and challenges associated with NGS data analysis are discussed.

The Power of Next Generation DNA Sequencing

Over the past couple of years, 100’s of journal articles demonstrating the power of NGS have been published. These articles have been accompanied by numerous press reports communicating the community’s excitement and selected quotes from those articles convey the transformational nature of these technologies. Experiments that, only a short time ago, required a genome center-styled approach can now be done by a single lab.

Transcriptome profiling and digital gene expression are tewo examples that illustrate the power of NGS. For a few thousand dollars, a single run of a SOLiD or Illumina sequencer can produce more ESTs (Expressed Sequence Tags) than exist in all of dbEST. Even more striking is that the data in dbEST were collected over many years and 100's million dollars were invested in getting the sequences. Also, it is important to understand that the data in dbEST represent only a relatively small number of molecules from a broad range of organisms and tissues. While dbEST has been useful for providing a glimpse into the complex nature of gene expression and, in the case of human, mouse and a few other organisms, mapping a reasonable number of genes, the resource is unable to provide the deeper details and insights into alternative splicing, antisense transcription, small RNA expression, and allelic variation that can only be achieved when a cell’s complement of RNA is sampled deeply. As recent literature demonstrates, each time an NGS experiment is conducted, we seem to get a new view about the complexity of gene expression and how genomes are transcribed. A similar story repeats when NGS is compared other gene expression measurement systems like microarrays.

Of course with the excitement about NGS, comes the frightening reports about how to handle the data. Just as one can go into the news and scientific literature to get great quotes about NGS power, one can get similar kinds of quotes describing the challenges associated with the increase in data production. Metaphors of fire hoses spraying data often accompany graphs showing exponential growth in data production and accumulation. When these growth curves are compared to Moore’s law, we learn that the data are accumulating at faster rates than improvements in hardware performance. So, those who plan to wait for data management and analysis problems to be solved by better computers, will have to wait a long time.

Analyzing data is harder than it looks

A significant portion of the NGS data problem discussion focuses on how to store the large amounts of data. While a major concern, there are even greater challenges with analyzing the data, this is often referred to as the “bioinformatics bottleneck.” To understand the reason for the bottleneck, we need to understand data NGS analysis. It is common to divide the high-level NGS analysis workflow that converts images into bases, and bases into results into three phases. In the first phase (primary analysis), millions of individual images are analyzed to create files containing millions of sequence reads. The resulting reads are then converted to biologically meaningful information, either by aligning the reads to reference data or assembling the reads into contigs (de novo sequencing). This second phase (secondary analysis) is application specific; each kind of sequencing experiment or assay requires a specific workflow in which reads are compared to different reference sources in multi-step iterative processes. The third phase (tertiary) consists of analyzing multiple datasets of information derived from the second phase in order to annotate genomes, measure gene expression, observe linked variants, or uncover epigenetic patterns.

The bioinformatics bottleneck stems primarily from the complexity of secondary analysis. Secondary analyses use several different alignment programs to compare reads to reference data. Each alignment program produces multiple output files, each containing a subset of information that must be interpreted to understand what the data mean. The output files are large in their own right. A few million reads, for example, can produce many tables of alignment data consisting of many millions of rows. It is not possible to get meaning out of such files without additional software to “read” the data. And, no two alignment programs produce data in a common format. To summarize, secondary analysis utilizes multiple programs, each program produces a varied number of files with data in different formats, and the files are too large and complex to be understood by reading the data by eye. More software is needed.

From a computational and software engineering stand point, there is room for improvement. When the common secondary analysis processes are analyzed further, we see that the many large hard to read files are the tip of the iceberg. The lack of well-organized data models and infrastructure for storing structured information results in significant levels of data processing and data copying redundancy. This is due to the need to represent common information in different contexts. The problem is, that with data throughput scaling faster than Moore’s law, maintaining systems with this kind of complexity and redundancy becomes unpredictably costly in terms of system performance, storage needs, and software development time. Ultimately, it limits what we can do with the data and the questions that can be asked.

The next post discusses how this problem can be addressed.

No comments: