Tuesday, February 26, 2008

The case for HDF

As we contemplate working Next Gen data we need to think about how we are going to store data and information efficiently. You might ask, what does that mean? Common, a file is a file isn't it?

Wrong

When one considers ways to work with data on a computer two problems must be solved: The first is how to structure the data. This is also known as defining the data model or format. The second is how to implement the data model. The data model describes data entities, the attributes of an entity (time, length, type) and the relationship between entities. The implementation is how the software programs and users will interact with the data (text files, binary files, relational databases, object databases, Excel, ...) to access data, perform calculations, and create information. Almost any kind of implementation can be used to solve any kind of problem, but in general, each type of problem has a limited optimal implementations. The factors that affect implementation choices include ease of use, scalability (time, space, and complexity), and application requirements (reads, writes, data persistence, updates ...). The rapid increases in the volumes of data being collected with current and future Next Gen sequencing technologies have significant data scalability and complexity issues and solutions like HDF become attractive. To understand why let's first look at some of the common data handling methods and discuss their advantages and disadvantages for working with data.

Text is easy
The most common implementation is text. Text formats dominate bioinformatics applications, for good reason. Text is human readable, can represent numbers as accurately as may be desired, and text is compatible with common utilities (e.g., grep), text editors, and languages like Perl. However, using text for high volume, complex data is inefficient and problematic. Computations must first convert data from text to binary, and, compared to binary data, it takes nearly three times as much space to represent an integer using text. Further, text files cannot represent complex data structures and relationships in ways that are easy to navigate computationally. Since scientific data are hierarchical in nature, systems that rely on text-based files often have multiple redundant methods for storing information. Text files lack random access; an object in a text file can be found only by reading the file from the beginning. Hence, almost all text file applications require that the entire file be read into memory, which seriously limits the practical size that a text file can be. Finally, the ease of creating new text-based formats leads to a number of obscure formats that proliferate rapidly, resulting in an interoperability nightmare requiring translators, that can import data from other applications, accompany almost every application.

XML must be better
The next step up in text is XML. XML is gaining popularity as a powerful data description language that uses standard tags to define the structure and the content of a file. XML can be used to store data, is very good at describing data and relationships, and works well if the amount of data is small. However, since XML is text-based, it has the same shortcomings as text files and is unsuitable for large data volumes.

How about a database?
For many bioinformatics applications, commercial and open-source database management systems (DBMSs) provide easy and efficient access to data; scientific tools submit declarative queries to the DBMS, which optimizes and executes the queries. Although DBMSs are excellent for many uses, they are often not sufficient for applications involving complex high volume data. Traditional DBMSs require significant performance tuning for scientific applications, a task that scientists are neither well prepared for, nor eager to address. Since most scientific data are write-once, and read-only thereafter, traditional relational concurrency control and logging mechanisms can add substantially to computing overhead. High-end database software, such as software for parallel systems, is especially expensive, and is not available for the leading-edge parallel machines popular in scientific computing centers. Most importantly, the complex data structures and data types needed to represent the data and relationships are not
supported well in most databases.

Let's go Bi
Binary formats are efficient for random data reading and writing, computation, storing numeric values, and data organization. These formats are used to store a significant amount of the data generated by analytical instruments and data created by desktop applications. Many of these formats however, are either proprietary or not publicly documented, limiting access to the data. This problem is often addressed in an unsatisfactory way through time-expensive, and error-prone, reverse engineering endeavors that can violate license agreements.

Next Gen needs a different approach
Indeed, the Next Gen community is arriving at the need to use binary file formats to improve data storage, access, and computational efficiency. The Short (Sequence) Read Format (SRF) is an example where a data format and binary file implementation are being used to develop efficient read storing methods. Popular algorithms such as Maq are also converting text files to binary files prior to computation to improve data access efficiency.

Wouldn't it be nice if we had a common binary format to implement our data models?

Why HDF?
For the reasons outlined above, Geospiza felt it would be worthwhile to explore general purpose, open source, binary file storage technologies, and looked to other scientific communities to learn how similar problems were being addressed. That search identified HDF (hierarchical data format) as a candidate technology. Initially developed in 1988 for storing scientific data, HDF is well established in many scientific fields and bioinformatics applications utilizing HDF can benefit from its long history and an infrastructure of existing tools.

Geospiza, together with The HDF Group, conducted a feasibility study to examine whether or not HDF would be helpful for addressing the data management problems of large volumes and complex biological data. The first test case looked at both issues by working with a large volume, highly complex, system for DNA sequencing-based SNP discovery. Through this study, HDF's strengths and data organization features (groups, sets, multidimensional arrays, transformations, linking objects, and general data storage for other binary data types and images) were evaluated to determine how well these features would handle SNP data. In addition to the proposed feasibility project with SNP discovery, other test cases were added, in collaboration with the NCBI, to test the ability of HDF to handle extremely large datasets. These addressed working with HapMap data and performing chromosomal scale LD (Linkage Disequilibrium) calculations.

That story is next ...

5 comments:

Unknown said...

Todd, any new info on this effort? From what I've explored with HDF5 this seems like a great fit for a number of data types we're seeing.

Todd Smith said...

John,

Right now a concerted effort on the HDF project is waiting on NIH for funding. It did not make the budget for the 2008 fiscal year, but has high priority for the next cycle. That is of course after congress passes a budget. Until recent history, this used to be in October. As this is an election year, we expect later, perhaps end of year, or Jan. 2009. That said, we are open to conversations to get requirements and also see what can be done now with HDF5.

Anonymous said...

Do you mind if I quote a few of your articles as long as I provide credit and sources back to your webpage?
My website is in the exact same niche as yours and my users
would really benefit from a lot of the information you present here.
Please let me know if this alright with you. Many thanks!


united cash loans

Todd Smith said...

Please do use the material. Glad you like it.

Todd

Todd Smith said...

Please do use the material. Glad you like it.

Todd