Sunday, February 24, 2008

Next Gen Sequencing Software

In my last post, I indicated that the next generation (Next Gen) of DNA sequencers was creating a lot of excitement in DNA sequencing. In the next couple of posts I want to share some of our plans for supporting Next Gen by discussing the poster that we presented at the AGBT and ABRF conferences.

The general goal of the poster was to share our thoughts on how Next Gen data will have to be dealt with in order to develop more scalable and interoperable data processing software. It presented on our work with HDF (hierarchical data format) technology and how that fit's into Geospiza's plans for meeting Next Gen data management challenges. The first phase, now complete, provides our customers with a solution that links samples to their data and puts in place the foundation needed for the second phase which focuses on developing and integrating the scientific data processing applications that will make sense of the data.

Once a lab is up and running with Next Gen technology they quickly face the data management problem. Basic file system technology and file servers allow groups to store their data in nested directory structures. After a few runs, however, people realize that it gets really hard to know what data go with which run or with which sample - the Excel file storing that information gets lost, or the README file didn't written. The situation becomes even worse when Next Gen instruments are run in the context of a core lab. Now the problem is exacerbated because you need to make the data available to your customers. Do you set up an FTP site? Or, do you make unix accounts on the file server for your end users? Or, do you deliver data on firewire drives or multigigabyte flash drives? Or do you just do the work for your client and hope that they do not want to reanalyze their data?

Geospiza has solved the first part of the problem. Our new product FinchLab Next Gen Edition allows labs to track sample preparation (DNA library construction, reagent kit tracking, and workflow organization) and link data to runs and samples. FinchLab Next Gen Edition also provides interfaces so that core labs can create a variety of order forms for any kind of service to link data to runs, samples, and orders making data accessible to outside parties (customers) through the FinchLab web browser user interface. And, all of this can be done without any custom programming services. Over the next few weeks, I'll fill in the details on how we can do that. For now, I'll focus on the poster, but make a final important point that FinchLab Next Gen Edition not only interoperates with all of the current Next Gen instruments, it also allows labs to integrate these data with current Sanger sequencing technologies in a common web interface.

With sample tracking and data management under control, the next challenge becomes what to do with the data that are collected. The scientific community is just at the beginning of this journey. In the past, bioinformatics efforts have emphasized algorithm (tools) development over the implementation details associated with data management, organization, and intuitive user interfaces. The result is software systems, built from point solutions, that do not adequately address problems outside of “expert” organizations. If a scientist wants to work with sequence data to understand a biological research problem, they must overcome the challenges of moving data between complex software programs, reformatting unstructured text files, traversing web sites, and writing programs and scripts to mine new output files before they can even begin to gain insights into their problem of interest. While formats and standards have always been discussed and debated many people working with Next Gen data understand the "single point" solution approaches of the past will not scale to today's problems.

That's where HDF fits in. It is clear to the community that new software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets being created by new technology. Geospiza is working with The HDF Group (THG, www.hdfgroup.org) to deliver these capabilities by building on a recognized technology that has proven its ability to meet similar scalability demands in other areas of science [1]. We call the extensible domain-specific data technologies that will be built "BioHDF.” BioHDF will provide the DNA sequencing community with a standardized infrastructure to support high-throughput data collection and analysis and engage an informatics group (THG) that is highly experienced in large-scale data management issues. The technology will make it possible to overcome current computational barriers that hinder analyses. Computer scientists will not have to “reinvent” file formats when developing new computational tools or interfaces to meet scalability demands and biologists will have new programs with the improved performance needed to work with larger datasets.

In the next post, I'll present the case for HDF.

Reference: [1] HDF-EOS "HDF-EOS Tools and Information Center," http://hdfeos.org

Get the poster: NextGenHDF.pdf

No comments: