Organizing Data - A resequencing project begins with a region of a genome, a gene or set of genes that will be studied. A researcher will have sets of samples from patient populations from which they will isolate DNA. PCR is used to amplify specific regions of DNA so that both chromosomal copies can be sequenced. The read data, obtained from chromatograms, are aligned to other reads and also to a reference sequence. Quality and trace information are used to predict whether the differences observed between reads and reference data are meaningful. The general data organization can be broken into a main part called the Gene Model and within it, two sub organizations: the reference and experimental data. The reference consists of the known state of information. Resequencing, by definition, focuses on comparing new data with a reference model. The reference model organizes all of the reference data including the reference sequence and annotations.
The sub organizations of data can be stored in HDF5 using its standard features. The two primary objects in HDF5 are "groups" and "datasets." The HDF5 group object is akin to a UNIX directory or Windows folder – it is a container for other objects. An HDF5 dataset is an array structure for storing data and has a rich set of types available for defining the elements in an HDF dataset array. They include simple scalars (e.g., integers and floats), fixed- and variable-length structures (e.g., strings), and "compound datatypes." A compound datatype is a record structure, whose fields can be any other datatype, including other compound types. Datasets and groups can contain attributes, which are simple name/value pairs for storing metadata. Finally, groups can be linked to any dataset or group in an HDF file, making it possible to show relationships among objects. These HDF objects allowed us to create an HDF file whose structure matched the structure and content of the Gene Model structure. Although the content of a Gene Model file is quite extensive and complex, the grouping and dataset structure of HDF makes it very easy to see the overall organization of the experiment. Since all the pieces are in one place, an application, or someone browsing the file, can easily find and access the particular data of interest. The figure to the left shows the HDF5 file structure we used. The ovals represent groups and the rectangles represent datasets. The grayed out groups (Genotyping, Expression, Proteomics) were not implemented.
Accessing the Data - HDF5's feasibility, and several advantageous features, are demonstrated by a screen shot obtained using HDFView, a cross platform Java-based application, that can be used view data stored in HDF5. This image below highlights the ability of HDF5, and HDF5-supporting technologies to meet the following requirements:
- Support combining a large number of files
- Provide simple navigation and access to data objects
- Support data analysis
- Integrate diverse data types
The left most panel of the screen (below) presents an "explorer" view of four HDF5 files (HapMap, LD, ADRB2, and FVIII), with their accompanying groups and datasets. Today, researchers store these data in separate files scattered throughout file systems. To share results with a colleague, they e-mail multiple spreadsheets or tab-delimited text files for each table of data. When all of the sequence data, basecall tables, assemblies, and genotype information are considered, the number of files becomes significant. For ADRB2 we combined the data from 309 individual files into a single HDF5 file. For FVIII, a genotyping study involving 39 amplicons and 137 patient samples, this number grows to more than 60,000 primary and versioned copied files.
With HDF5 these data are encapsulated in a single file thus simplifying data management in increasing data portability.
Example screen from the prototype demo. HDFView, a JAVA viewer for HDF5 files can display multiple HDF5 files and for each file, the structure of the data in the file. Datasets can be shown as tables, line plots, histograms and images. This example shows a HapMap dataset, LD calculations for a region of chromosome 22 and the data from two resequencing projects. The HapMap dataset (upper left) is a 52,636-row table of alleles from chromosome 22. Below it is an LD plot from the same chromosome. The resequencing projects, adrb2 and factor 8, show reference data and sequencing data. The table (middle) is a subsection of the poly table obtained from Phred. Using the line plot feature in HDFView, sub sections of the table were graphed. The upper right graph compares the called base peak areas (black line, top) to the uncalled peak areas (red, bottom) for the entire trace. The middle right graph highlights the region between 250 and 300 bases. The large peak at position 36 (position 286 in the center table, and top graph) represents a mixed base. The lower right graph is a "SNPshot" showing the trace data surrounding the variation.
In addition to reducing file management complexity, HDF5 and HDFView have a number of data analysis features that make it possible to deliver research-quality applications quickly. In the ADRB2 case, the middle table in the screen shot is a section of one of the basecall tables produced by Phred using its "-d" option. This table was opened by selecting the parent table and defining the columns and region to display. As this is done via HDF5's API, it is easy to envision a program "pulling" relevant slices of data from a data object, performing calculations from the data slices and storing the information back as a table that can be viewed from the interface. Also important, the data in this example are accessible and not "locked" away in a proprietary system. Not only is HDF an open format, HDFView allows one to export data as HDF subsets, images, and tab delimited tables. HDFView's copy functions allow one to easily copy data into other programs like Excel.
HDFView can produce basic line graphs that can be used immediately for data analysis, such as the two that are shown here for ADRB2. The two plots, in the upper right corner of the screen show the areas of the peaks for called (black, upper line) and uncalled (red, lower line) bases. The polymorphic base can be seen in the top plot as a spike in the secondary peak area. The lower graph contains the same data, plotted from the region between base 250 and 300 of the read. This plot shows a high secondary peak with a concomitant reduction in the primary peak area. One of PolyPhred's criteria for identifying a heterozygous base, that primary and secondary peak areas are similar, is easily observed. The significance of this demonstration is that HDF5 and HDFView have significant potential in algorithm development, because they can provide rapid access to different views of the data.
More significantly HDFView was used without any modifications demonstrating the benefit of a standard implementation system like HDF5.
Combining Diverse Data - The screen shot also shows the ability of HDF5 to combine and present diverse data types. Data from a single file containing both SNP discovery projects are shown, in addition to HapMap data (chromosome 22) and an LD plot consisting of a 1000 x 1000 array of LD values from a region of chromosome 22.
As we worked on this project, we became more aware of the technology limitations that hinder work with the HapMap and very large LD datasets and concluded that the HapMap data would provide an excellent test case for evaluating the ability of HDF5 to handle extremely large datasets.
For this test, a chromosome 22 genotype dataset was obtained from HapMap.org. Uncompressed, this is a large file (~24MB), consisting of a row of header information followed by approximately 52,000 rows of genotyped data. As a text file, it is virtually indecipherable and needs to be parsed and converted to a tabular data structure before the data can be made useful. When one considers that even for the smallest chromosome, the dataset is close to Microsoft Excel's (2003, 2004) row limit (65,535), the barrier to entry for the average biologist wishing to the use this data is quite high.
To put the data into HDF5 we made an XML schema of the structure to understand the model, built a parser, and loaded the data. As can be seen in HDFView (Fig. 6), the data were successfully converted from a difficult-to-read text form to a well-structured table where all of
the data can be displayed. At this point, HDFView can be used to extract and analyze subsets of information.
The LD test demonstrates the power of HDF5's collection of sophisticated storage structures. We have seen that we can compress datasets inside an HDF5 file, but we also see that compressing an entire dataset creates access problems. HDF5 deals with this problem through a storage feature called "chunking." Whereas most file formats store an array as a contiguous stream of bytes, chunking in HDF5 involves dividing a dataset into equal-sized "chunks" that are stored and accessed separately.
LD plot for chromosome 22. A 1000 x 1000 point array of LD calculations is shown. The table in the upper right shows the LD data for a very small region of the plot, demonstrating HDF's ability to allow one to easily select "slices" of datasets.
Chunking has many important benefits, two of which apply particularly to large LD arrays. Chunking makes it possible to achieve good performance when accessing subsets of the datasets, even when the chosen subset is orthogonal to the normal storage order of the dataset. If a very large LD array is stored contiguously, most sub-setting operations require a large number of accesses, as data elements with spatial locality can be stored a long way from one another on a disk, requiring many seeks and reads. With chunking, spatial locality is preserved in the storage structure, resulting in faster access to subsets. When a data subset is accessed, only the chunks that contain specific portions of the data need to be un-compressed. The extra memory and time required to uncompress the entire LD array are both avoided.
Using both chunking and compression, HDF5 compressed the data in the chromosome 22 LD array from 5.2 gigabytes to 300 megabytes, a 17-fold decrease in storage space. Furthermore, the LD array was then immediately available for viewing in HDFView, where it was converted to an image with color intensity used to show higher linkage. The display also shows a table of LD values corresponding to a subset of the larger LD plot. In HDFView, one can "box" select a region of pixels from an image and use it to create a subset of data. This is an important feature, as it will be impossible to view an entire chromosome LD plot at single pixel resolution. Thus, a matrix of lower resolution regions will need to be created and viewed in HDFView. The lower resolution image can highlight regions of high LD and, using a tool like HDFView, one can then select those regions and drill down into the underlying data.
HDF5 has many practical benefits for bioinformatics. As a standardized file technology data models can be implemented and tools like HDFView can be used to quickly visualize the organization of the data and results of computation. Computational scientists can develop new algorithms faster because they do not have to invest time developing new formats and new GUIs to view their data. The community can benefit because data become more portable. Finally, HDF is well suited for enhancing application performance through data compression, chunking, and memory mapping. Many of these features will become extremely valuable as Next Gen technologies push the volumes of data to higher and higher levels.