Last month, at the Department of Energy's Sequencing, Finishing and Analysis in the Future meeting, I presented Geospiza's product development work and how BioHDF is contributing to scalable infrastructures. The abstract, presentation, and link to the presentation are posted below.
Abstract
Next Generation DNA Sequencing (NGS) technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. Presently, the value of NGS technology has been largely demonstrated on individual sample analyses. The full potential of NGS will be realized when it can be used in multisample experiments that involve different measurements and include replicates, and controls to make valid statistical comparisons. Arguably, improvements in current technology, and soon to be available “third” generation systems, will make it possible to simultaneously measure 100’s to1000’s of individual samples in single experiments to study transcription, alternative splicing, and how sequences vary between individuals and within expressed genes. However, several bioinformatics systems challenges must be overcome to effectively manage both the volumes of data being produced and the complexity of processing the numerous datasets that will be generated.
Future bioinformatics applications need to be developed on common standard infrastructures that can reduce overall data storage, increase data processing performance, integrate information from multiple sources and are self-describing. HDF technologies meet all of these requirements, have a long history, and are widely used in data-intensive science communities. They consist of general data file formats, software libraries and tools for manipulating the data. Compared to emerging standards such as the SAM/BAM formats, HDF5-based systems demonstrate improved I/O performance and improvedmethods to reduce data storage. HDF5 isalso more extensible and can support multiple data indexes and store multiple data types. For these reasons, HDF5 and its BioHDF implementation are well qualified as standards for implementing data models in binary formats to support the next generation of bioinformatics applications. Through this presentation we will demonstrate BioHDF's latest features in NGS applications that target transcription analysis and resequencing.
Acknowledgments
Contributing Authors: Todd Smith (1), Christopher E Mason (2), Paul Zumbo (2), Mike Folk (3), Dana Robinson (3), Mark Welsh (1), Eric Smith (1), N. Eric Olson (1),
1. Geospiza, Inc. 100 West Harrison N. Tower 330, Seattle WA 98119 2. Department of Physiology and Biophysics, Weil Cornell Medical College, 1305 York Ave., New York NY, 10021 3. The HDF Group, 1901 S. First St., Champaign IL 61820
Funding: NIH: STTR HG003792