Saturday, July 24, 2010

A Programmer's Perspective

There are many analogies describing the human genome as the program that runs a computer. Jim Stalker, the Senior Scientific Manager in charge of vertebrate resequencing informatics at the Sanger center eloquently described what that really means in his “Obligatory Tenuous Coding Analogy” during the Perl lightning talks at this year's OSCON conference.

The genome is the source of a program to build and run a human
But: the author is not available for comment
It’s 3GB in size
In a single line
Due to constant forking, there are about 7 billion different versions
It’s full of copy-and-paste and cruft
And it’s completely undocumented
Q: How do you debug it?


I thank Jim for sharing his slides. Post a comment if you think you know the answer. Jim’s slides will be posted at the O’Reilly OSCON slide site.

Wednesday, July 14, 2010

Increasing the Scale of Deep Sequencing Data Analysis with BioHDF

Last month, at the Department of Energy's Sequencing, Finishing and Analysis in the Future meeting, I presented Geospiza's product development work and how BioHDF is contributing to scalable infrastructures. The abstract, presentation, and link to the presentation are posted below.


Next Generation DNA Sequencing (NGS) technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. Presently, the value of NGS technology has been largely demonstrated on individual sample analyses. The full potential of NGS will be realized when it can be used in multisample experiments that involve different measurements and include replicates, and controls to make valid statistical comparisons. Arguably, improvements in current technology, and soon to be available “third” generation systems, will make it possible to simultaneously measure 100’s to1000’s of individual samples in single experiments to study transcription, alternative splicing, and how sequences vary between individuals and within expressed genes. However, several bioinformatics systems challenges must be overcome to effectively manage both the volumes of data being produced and the complexity of processing the numerous datasets that will be generated.

Future bioinformatics applications need to be developed on common standard infrastructures that can reduce overall data storage, increase data processing performance, integrate information from multiple sources and are self-describing. HDF technologies meet all of these requirements, have a long history, and are widely used in data-intensive science communities. They consist of general data file formats, software libraries and tools for manipulating the data. Compared to emerging standards such as the SAM/BAM formats, HDF5-based systems demonstrate improved I/O performance and improvedmethods to reduce data storage. HDF5 isalso more extensible and can support multiple data indexes and store multiple data types. For these reasons, HDF5 and its BioHDF implementation are well qualified as standards for implementing data models in binary formats to support the next generation of bioinformatics applications. Through this presentation we will demonstrate BioHDF's latest features in NGS applications that target transcription analysis and resequencing.

SciVee Video


Contributing Authors: Todd Smith (1), Christopher E Mason (2), Paul Zumbo (2), Mike Folk (3), Dana Robinson (3), Mark Welsh (1), Eric Smith (1), N. Eric Olson (1),

1. Geospiza, Inc. 100 West Harrison N. Tower 330, Seattle WA 98119 2. Department of Physiology and Biophysics, Weil Cornell Medical College, 1305 York Ave., New York NY, 10021 3. The HDF Group, 1901 S. First St., Champaign IL 61820

Funding: NIH: STTR HG003792