Over the next few posts I will share the slides from the presentation. This post begins with the abstract.
“If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow” was the closing statement in the lead Nature Biotechnology editorial “Prepare for the deluge” (Oct. 2008). The oft-stated challenges focus on the obvious problems of storing and analyzing data. However, the problems are much deeper than the short descriptions portray. True, researchers are ill-prepared to confront the challenges of inadequate IT infrastructures, but there is a greater challenge in that there is a lack of easy to use, well-performing software systems and interfaces that would allow to researchers to work with data in multiple ways to summarize information and drill down into supporting details.
Meeting the above challenge requires that we have well performing software frameworks and underlying data management tools to store and organize data in better ways than complex mixtures of flat files and relational databases. Geospiza and The HDF Group are collaborating to develop open-source, portable, scalable, bioinformatics technologies based on HDF5 (Hierarchical Data Format – http://www.hdfgroup.org). We call these extensible domain-specific data technologies “BioHDF.” BioHDF will implement a data model that supports primary DNA sequence information (reads, quality values, meta data) and the results from sequence alignment and variation detection algorithms. BioHDF will extend HDF5 data structures and library routines with new features (indexes, additional compression, graph layouts) to support the high performance data storage and computation requirements of Next Gen Sequencing.
For close to 20 years, HDF data formats and software infrastructure have been used to manage and access high volume complex data in hundreds of applications, from flight testing to global climate research. The BioHDF effort is leveraging these strengths. We will show data from small RNA and gene expression analyses that demonstrate HDF5’s value for reducing the space, time, bandwidth, and development costs associated with working with Next Gen Sequence data.
The next posts will cover:
- Why NGS is exiting and challenges that can be overcome with HDF5
- What the BioHDF project is and some examples of what we are doing with HDF5
- Some background on HDF5 (Hierarchical Data Format)