Sunday, July 12, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part V: Why HDF5?

Through the course of this BioHDF bloginar series, we have demonstrated how the HDF5 (Hierarchical Data Format) platform can successfully meet current and future data management challenges posed by Next Generation Sequencing (NGS) technologies. We now close the series by discussing the reasons why we chose HDF5.

For previous posts, see:

  1. The introduction
  2. Project background
  3. Challenges of working with NGS data
  4. HDF5 benefits for working with NGS data

Why HDF5?

As previously discussed, HDF technology is designed for working with large amounts of complex data that naturally organize into multidimensional arrays. These data are composed of discrete numeric values, strings of characters, images, documents, and other kinds of data that must be compared in different ways to extract scientific information and meaning. Software applications that work with such data must meet a variety of organization, computation, and performance requirements to support the communities of researchers where they are used.

When software developers build applications for the scientific community, they must decide between creating new file formats and software tools for working with the data, or adapting existing solutions that already meet a general set of requirements. The advantage of developing software that's specific to an application domain is that highly optimized systems can be created. However, this advantage can disappear when significant amounts of development time are needed to deal with the "low-level" functions of structuring files, indexing data, tracking bits and bytes, making the system portable across different computer architectures, and creating a basic set of tools to work with the data. Moreover, such a system would be unique, with only a small set of users and developers able to understand and share knowledge concerning its use.

The alternative to building a highly optimized domain-specific application system is to find and adapt existing technologies, with a preference for those that are widely used. Such systems benefit from the insights and perspective of many users and will often have features in place before one even realizes they are needed. If a technology has widespread adoption, there will likely be a support group and knowledge base to learn from. Finally, it is best to choose a solution that has been tested by time. Longevity is a good measure of the robustness of the various parts and tools in the system.

HDF: 20 Years in Physical Sciences

Our requirements for high-performance data management and computation system are these:

  1. Different kinds of data need to be stored and accessed.
  2. The system must be able to organize data in different ways.
  3. Data will be stored in different combinations.
  4. Visualization and computational tools will access data quickly and randomly.
  5. Data storage must be scalable, efficient, and portable across computer platforms.
  6. The data model must be self describing and accessible to software tools.
  7. Software used to work with the data must be robust, and widely used.

HDF5 is a natural fit. The file format and software libraries are used in some of the largest data management projects known to date. Because of its strengths, HDF5 is independently finding its way into other bioinformatics applications and is a good choice for developing software to support NGS.

HDF5 software provides a common infrastructure that allows different scientific communities to build specific tools and applications. Applications using HDF5 typically contain three parts: one or more HDF5 files to store data, a library of software routines to access the data, and the tools, applications and additional libraries to carry out functions that are specific to a particular domain. To implement an HDF5-based application, a data model be developed along with application specific tools such as user interfaces and unique visualizations. While implementation can be a lot of work in its own right, the tools to implement the model and provide scalable, high-performance programmatic access to the data have already been developed, debugged, and delivered through the HDF I/O (input/output) library.

In earlier posts, we presented examples where we needed to write software to parse fasta formatted sequence files and output files from alignment programs. These parsers then called routines in the HDF I/O library to add data to the HDF5 file. During the import phase, we could set different compression levels and define the chunk size to compress our data and optimize access times. In these cases, we developed a simple data model based on the alignment output from programs like BWA, Bowtie, and MapReads. Most importantly, we were able to work with NGS data from multiple platforms efficiently, with software that required weeks of development rather than the months and years that would be needed if the system was built from scratch.

While HDF5 technology is powerful "out-of-the-box," a number of features can still be added to make it better for bioinformatics applications. The BioHDF project is about making such domain-specific extensions. These are expected to include modifications to the general file format to better support variable data like DNA sequences. I/O library extensions will be created to help HDF5 "speak" bioinformatics by creating APIs (Application Programming Interfaces) that understand our data. Finally, sets of command line programs and other tools will be created to help bioinformatics groups get started quickly with using the technology.

To summarize, the HDF5 platform is well-suited for supporting NGS data management and analysis applications. Using this technology, groups will be able to make their data more portable for sharing because the data model and data storage are separated from the implementation of the model in the application system. HDF5's flexibility for the kinds of data it can store, makes it easier to integrate data from a wide variety of sources. Integrated compression utilities and data chunking make HDF5-based systems as scalable as they can be. Finally, because the HDF5 I/O library is extensive and robust, and the HDF5 tool kit includes basic command-line and GUI tools, a platform is provided that allows for rapid prototyping, and reduced development time, thus making it easier to create new approaches for NGS data management and analysis.

For more information, or if you are interested in collaborating on the BioHDF project, please feel free to contact me (todd at geospiza.com).

No comments: