FinchTalk: Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part I: Introduction

Sunday, June 7, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part I: Introduction

At the end of May, the DOE's Los Alamos National Laboratory hosted its 4th annual Sequencing, Finishing and Analysis meeting in Santa Fe, New Mexico. We participated in the conference by presenting our work with using HDF5 to develop scalable software for Next Generation DNA Sequencing (NGS) analysis.

Over the next few posts I will share the slides from the presentation. This post begins with the abstract.

Abstract

“If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow” was the closing statement in the lead Nature Biotechnology editorial “Prepare for the deluge” (Oct. 2008). The oft-stated challenges focus on the obvious problems of storing and analyzing data. However, the problems are much deeper than the short descriptions portray. True, researchers are ill-prepared to confront the challenges of inadequate IT infrastructures, but there is a greater challenge in that there is a lack of easy to use, well-performing software systems and interfaces that would allow to researchers to work with data in multiple ways to summarize information and drill down into supporting details.

Meeting the above challenge requires that we have well performing software frameworks and underlying data management tools to store and organize data in better ways than complex mixtures of flat files and relational databases. Geospiza and The HDF Group are collaborating to develop open-source, portable, scalable, bioinformatics technologies based on HDF5 (Hierarchical Data Format – http://www.hdfgroup.org). We call these extensible domain-specific data technologies “BioHDF.” BioHDF will implement a data model that supports primary DNA sequence information (reads, quality values, meta data) and the results from sequence alignment and variation detection algorithms. BioHDF will extend HDF5 data structures and library routines with new features (indexes, additional compression, graph layouts) to support the high performance data storage and computation requirements of Next Gen Sequencing.

For close to 20 years, HDF data formats and software infrastructure have been used to manage and access high volume complex data in hundreds of applications, from flight testing to global climate research. The BioHDF effort is leveraging these strengths. We will show data from small RNA and gene expression analyses that demonstrate HDF5’s value for reducing the space, time, bandwidth, and development costs associated with working with Next Gen Sequence data.

The next posts will cover:

Why NGS is exiting and challenges that can be overcome with HDF5
What the BioHDF project is and some examples of what we are doing with HDF5
Some background on HDF5 (Hierarchical Data Format)

2 comments:

kevin said...: I am just beginning to look for specs for a cluster computer to analyze NGS data I can verify that this is going to be a big problem. Basically everyone says more is better but not everyone has the expertise and budget for MORE. I am interested to find out how NGS data crunching can be made more accessible to all.; November 30, 2009 at 11:07 PM
Todd Smith said...: One way to make NGS data crunching available to all is to use the power of computers in the cloud. We are helping a lot of people with their NGS data analysis this way. I encourage you to visit our info page: http://www.geospiza.com/Contact/moreinfo.shtml to learn more.; December 2, 2009 at 7:53 PM