Tuesday, March 17, 2009

Introducing BioHDF

Today we released news of our funding for the BioHDF project. This project, in collaboration with The HDF Group, is focused on developing scalable bioinformatics technologies to support the many current and emerging Next Generation Sequencing (NGS) applications such as Transcription Profiling, Digital Gene Expression, Small RNA Analysis, Copy Number Variation and Resequencing.

You might ask, isn’t Geospiza already doing all of those things listed above? The answer is yes, but there is more to NGS work than meets the eye.

When we work with NGS data today, we do so with software and systems that are inefficient in three important areas: Space, Time, and Bandwidth. Although people are managing to get by, as throughput continues to increase, space, time and bandwidth issues are going become more acute and will lead to disproportionally higher software development and data management costs. BioHDF will help solve the space, time, and bandwidth issues at the infrastructure level.

Space issues are related to storing the data. As we know, NGS systems create a lot of data. Less understood is that when the data are processed the outputs of the alignment programs produce an amount of data equal or larger than the original input files. Thus, one challenge groups face, is that despite their plans for a certain amount of storage, after they've run some programs, they get surprised by finding they’ve run out of space, sometimes even in the middle of a one day program run :-(

Current practices for computing alignments also create time inefficiencies in detailing with NGS data. Today’s algorithms are improving, but still largely require that data be held in computer memory (RAM) during processing. When problems get too large, algorithms use disk space in a random way. Also known as swapping, this process kills performance and jobs must be terminated. In many cases, the problem is handled by breaking the problem down into smaller units for computation, writing scripts to track the pieces and steps, and putting the output files together at the end.

Additionally, many programs require that data be in a certain format prior to processing. As files get ever larger, the time needed to reformat data adds significantly to the computational burden.

Bandwidth is a measure of the data transfer rate. Small files transfer quickly, but when they get larger, there is a noticeable lag between the start and finish. With NGS, the data transfer rate becomes a significant factor in systems planning. Some groups have gone to great lengths to improve their networks when setting up their labs for NGS. In our cloud computing services, we use specialized software to improve data transfer rates and ensure transfers are complete because tools like ftp are not robust enough to reliably handle NGS data volumes.

BioHDF will help us work with space, time, bandwidth constraints.

Meeting the above challenge, requires that we have well performing software frameworks and underlying data management tools to store and organize data in better ways than complex mixtures of flat files and relational databases. Geospiza and The HDF Group are collaborating to develop open-source, portable, scalable, bioinformatics technologies based on HDF5 (Hierarchical Data Format). We call these extensible domain-specific data technologies “BioHDF.” BioHDF will implement a data model that supports primary DNA sequence information (reads, quality values, meta data) and the results from sequence alignment and variation detection algorithms. BioHDF will extend HDF5 data structures and library routines with new features (indexes, additional compression, graph layouts) to support the high performance data storage and computation requirements of Next Gen Sequencing.

No comments: