FinchTalk: Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part III: The HDF5 Advantage

The Next Generation DNA Sequencing (NGS) bioinformatics bottleneck is related to the complexity of working with the data, analysis programs, and numerous output files that are produced as the data are converted from images to final results. Current systems lack well-organized data models and corresponding infrastructures for storing data and analysis information resulting in significant levels of data processing, reprocessing, and data copying redundancies. Systems can be improved with data management technologies like HDF5.

In this third installment of the bloginar, results from our initial work with HDF5 are presented. Previous posts have provided an introduction to the series and background on NGS.

Working with NGS data

With the exception of de novo sequencing, in which novel genomes, or transcriptomes, are analyzed, many NGS applications can be thought of as quantitative assays where DNA sequences are highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation of the bases between sequences in the alignments. Data tables from samples that differ by experimental treatment, environment, or populations, are compared in different ways to make additional discoveries and draw conclusions. Whether the assay is to measure gene expression, study small RNAs, understand gene regulation, or quantify genetic variation, a similar process is followed:

A large dataset of sequence reads are collected in a massively parallel format.
The reads are aligned to reference data.
Alignment data, stored in multiple kinds of output files, are parsed, reformatted, and computationally organized to create assay specific reports.
The reports are reviewed and decisions are made for how to work with the next sample, experiment, or assay.

Current practices where multiple programs create different kinds of information, formatted in different ways make this process difficult even for single samples. Achieving our main goal of comparing the analysis for multiple samples at a time is harder still. Presently, the four steps listed above must be repeated for each sample and multiple reports, that list expression values for single genes, or describe positional frequencies of read density, must be combined in some fashion to create new views to summarize or compute the differences and similarities between datasets. In the case of gene expression, for example, volcano plots can be used to compare the observed changes in gene expression with the likelihood that the observed changes are statistically significant. For a given gene, one might also want to drill into details that show how the reads align to the gene’s reference sequence. Further, the alignments from different samples for the gene need to be compared to see if there is evidence of alternative splicing or other interesting features.

Creating software that allows one to view NGS results in single and multi sample contexts, drill into multiple levels of detail, operates quickly and smoothly, and makes it possible for IT administrators and PIs to predict development and research costs, requires us to store raw data and corresponding analysis results in structures that support the computational needs for the problems being addressed. To accomplish this goal, we can either develop a brand new infrastructure to support our technical requirements and build new software to support our applications, or we can build software on an existing infrastructure and benefit from the experience gained in solving similar problems in other scientific fields.

Geospiza is following the latter path and is using an open-source technology, HDF5 (hierarchical data format), to develop highly scalable bioinformatics applications. Moreover, as we have examined past practices and consider our present and future challenges, we have concluded that technologies like HDF5 have great benefits for the bioinformatics field. Toward this goal, Geospiza has initiated a research program collaborating with The HDF Group to develop extensions to HDF5 that meet specific requirements for genetic analysis.

HDF5 Advantages

Introducing a new technology into an infrastructure requires work. Existing tools need to be refactored and new development practices must be learned. The cost of switching over to new technology has direct development costs associated with refactoring tools and learning new environments as well as a time lag between learning the system and producing new features. Justifying such a commitment demands a return on investment. Hence, the new technology must offer several advantages over current practices, such as improved systems performance and (or) new capabilities that are not easily possible with existing approaches. HDF5 offers both.

With HDF5 technology, we will be able to create better performing NGS data storage and high performance data processing systems, and approach data analysis problems differently.

We'll consider system performance first. Current NGS systems store reads and associated data primarily in text-based flat files. Additionally, the vast majority of alignment programs also store data in text-based flat files, creating the myriad of challenges described earlier. When these data are, instead, stored in HDF5, a number of improvements can be acheived. Because the HDF5 software library and file format can store data as compressed “chunks,” we can reduce storage requirements and access subsets of data more efficiently. For example, read data can be stored in arrays making it possible to quickly compute values like nucleotide frequency statistics for each base position in the reads from an entire multimillion read dataset.

In the example presented, 9.3 million Illumina GA reads were stored in HDF5 as a compressed two dimensional array resulting in a four fold reduction in size when compared to the original fasta formatted file. When the reads were aligned to a human genome reference, the flat file system grew from 609 MB to 1033 MB. The HDF5-based system increased in size by 230 MB to a total of 374 MB for all data and indices combined. In this simple example, the storage benefits of HDF5 are clear.

We can also demonstrate the benefits of improving the efficiency of accessing data. A common bioinformatics scenario is to align a set of sequences (queries) to a set of reference sequences (subjects) and then examine how the query sequences compare to the subject sequence within a specific range. Software routines accomplish this operation by getting the name (or ID) of a subject sequence along with the beginning and ending positions of the desired range(s). This information is used to first search the set of alignments for the names (or IDs) of the query sequences that match and query’s beginning and ending positions that match in the alignment. Next, the dataset of query sequences is searched to retrieve the matching data. When the data are stored in a non-indexed flat file, the entire file must be read to find the matching sequences. This takes, on average, half of the time needed to read the entire file. In contrast, indexed data can be accessed in a significantly reduced amount of time. The shorter time derives from two features: 1. A smaller amount of data needs to be read to conduct the search, and 2. Structured indices make searches more efficient.

In our example, the 9.3 million reads produced many millions of alignments when the data were compared to the human genome. We tested the performance for retrieving read alignment information from different kinds of file systems by accessing the alignments from successively smaller regions of chromosome 5. The entire chromosome contained roughly one million alignments. Retrieving the reads from the entire chromosome was slightly more efficient in HDF5 than retrieving the same data from the flat file system. However, as fewer reads were retrieved from smaller regions, the HDF5-based system demonstrated significantly better performance. For HDF5, the time to retrieve reads decreased as a function of the amount of data being retrieved down to 15 ms, the minimum overhead of running the program that accesses the HDF5 file. When compared to the minimum access time for the flat file (735 ms), a ~50 fold improvement is observed. As datasets continue to grow, the overhead for using the HDF5 system will remain at 15 ms, whereas the overhead for flat file systems will continue to increase.

The demonstrated performance advantages are not unique to HDF5. Similar results can be achieved by creating a data model to store the reads and alignments and implementing the model in a binary file format with indices to access the stored data in random ways. A significant advantage of HDF5 is that the software to implement the data models, build multiple kinds of indices, compress data in chunks, and read and write the data to and from the file has already been built, debugged, and supported by over 20 years of development. Hence, one of the most significant performance advantages associated with using the HDF platform is the savings in development time. To reproduce a similar, perhaps more specialized, system would require many months (even years) to develop, test, document, and refine the low-level software needed to make the system well-performing, highly scalable, and broadly usable. In our experience with HDF5, we’ve been able to learn the system, implement our data models, and develop the application code in a matter of weeks.

Consequently, we are spending more of our time solving the interesting challenges associated with analyzing millions of NGS reads from 100's or 1000's of samples to measure gene expression, identify alternatively spliced and small RNAs, study regulation, calculate sequence variation, and link summarized data to its underlying details, and we are spending a much smaller fraction of our time optimizing low-level infrastructures.

Additional examples of how HDF5 is changing our thinking will be presented next.

FinchTalk

Monday, June 15, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part III: The HDF5 Advantage

No comments: