FinchTalk: February 2008

Tuesday, February 26, 2008

The case for HDF

As we contemplate working Next Gen data we need to think about how we are going to store data and information efficiently. You might ask, what does that mean? Common, a file is a file isn't it?

Wrong

When one considers ways to work with data on a computer two problems must be solved: The first is how to structure the data. This is also known as defining the data model or format. The second is how to implement the data model. The data model describes data entities, the attributes of an entity (time, length, type) and the relationship between entities. The implementation is how the software programs and users will interact with the data (text files, binary files, relational databases, object databases, Excel, ...) to access data, perform calculations, and create information. Almost any kind of implementation can be used to solve any kind of problem, but in general, each type of problem has a limited optimal implementations. The factors that affect implementation choices include ease of use, scalability (time, space, and complexity), and application requirements (reads, writes, data persistence, updates ...). The rapid increases in the volumes of data being collected with current and future Next Gen sequencing technologies have significant data scalability and complexity issues and solutions like HDF become attractive. To understand why let's first look at some of the common data handling methods and discuss their advantages and disadvantages for working with data.

Text is easy
The most common implementation is text. Text formats dominate bioinformatics applications, for good reason. Text is human readable, can represent numbers as accurately as may be desired, and text is compatible with common utilities (e.g., grep), text editors, and languages like Perl. However, using text for high volume, complex data is inefficient and problematic. Computations must first convert data from text to binary, and, compared to binary data, it takes nearly three times as much space to represent an integer using text. Further, text files cannot represent complex data structures and relationships in ways that are easy to navigate computationally. Since scientific data are hierarchical in nature, systems that rely on text-based files often have multiple redundant methods for storing information. Text files lack random access; an object in a text file can be found only by reading the file from the beginning. Hence, almost all text file applications require that the entire file be read into memory, which seriously limits the practical size that a text file can be. Finally, the ease of creating new text-based formats leads to a number of obscure formats that proliferate rapidly, resulting in an interoperability nightmare requiring translators, that can import data from other applications, accompany almost every application.

XML must be better
The next step up in text is XML. XML is gaining popularity as a powerful data description language that uses standard tags to define the structure and the content of a file. XML can be used to store data, is very good at describing data and relationships, and works well if the amount of data is small. However, since XML is text-based, it has the same shortcomings as text files and is unsuitable for large data volumes.

How about a database?
For many bioinformatics applications, commercial and open-source database management systems (DBMSs) provide easy and efficient access to data; scientific tools submit declarative queries to the DBMS, which optimizes and executes the queries. Although DBMSs are excellent for many uses, they are often not sufficient for applications involving complex high volume data. Traditional DBMSs require significant performance tuning for scientific applications, a task that scientists are neither well prepared for, nor eager to address. Since most scientific data are write-once, and read-only thereafter, traditional relational concurrency control and logging mechanisms can add substantially to computing overhead. High-end database software, such as software for parallel systems, is especially expensive, and is not available for the leading-edge parallel machines popular in scientific computing centers. Most importantly, the complex data structures and data types needed to represent the data and relationships are not
supported well in most databases.

Let's go Bi
Binary formats are efficient for random data reading and writing, computation, storing numeric values, and data organization. These formats are used to store a significant amount of the data generated by analytical instruments and data created by desktop applications. Many of these formats however, are either proprietary or not publicly documented, limiting access to the data. This problem is often addressed in an unsatisfactory way through time-expensive, and error-prone, reverse engineering endeavors that can violate license agreements.

Next Gen needs a different approach
Indeed, the Next Gen community is arriving at the need to use binary file formats to improve data storage, access, and computational efficiency. The Short (Sequence) Read Format (SRF) is an example where a data format and binary file implementation are being used to develop efficient read storing methods. Popular algorithms such as Maq are also converting text files to binary files prior to computation to improve data access efficiency.

Wouldn't it be nice if we had a common binary format to implement our data models?

Why HDF?
For the reasons outlined above, Geospiza felt it would be worthwhile to explore general purpose, open source, binary file storage technologies, and looked to other scientific communities to learn how similar problems were being addressed. That search identified HDF (hierarchical data format) as a candidate technology. Initially developed in 1988 for storing scientific data, HDF is well established in many scientific fields and bioinformatics applications utilizing HDF can benefit from its long history and an infrastructure of existing tools.

Geospiza, together with The HDF Group, conducted a feasibility study to examine whether or not HDF would be helpful for addressing the data management problems of large volumes and complex biological data. The first test case looked at both issues by working with a large volume, highly complex, system for DNA sequencing-based SNP discovery. Through this study, HDF's strengths and data organization features (groups, sets, multidimensional arrays, transformations, linking objects, and general data storage for other binary data types and images) were evaluated to determine how well these features would handle SNP data. In addition to the proposed feasibility project with SNP discovery, other test cases were added, in collaboration with the NCBI, to test the ability of HDF to handle extremely large datasets. These addressed working with HapMap data and performing chromosomal scale LD (Linkage Disequilibrium) calculations.

That story is next ...

Sunday, February 24, 2008

Next Gen Sequencing Software

In my last post, I indicated that the next generation (Next Gen) of DNA sequencers was creating a lot of excitement in DNA sequencing. In the next couple of posts I want to share some of our plans for supporting Next Gen by discussing the poster that we presented at the AGBT and ABRF conferences.

The general goal of the poster was to share our thoughts on how Next Gen data will have to be dealt with in order to develop more scalable and interoperable data processing software. It presented on our work with HDF (hierarchical data format) technology and how that fit's into Geospiza's plans for meeting Next Gen data management challenges. The first phase, now complete, provides our customers with a solution that links samples to their data and puts in place the foundation needed for the second phase which focuses on developing and integrating the scientific data processing applications that will make sense of the data.

Once a lab is up and running with Next Gen technology they quickly face the data management problem. Basic file system technology and file servers allow groups to store their data in nested directory structures. After a few runs, however, people realize that it gets really hard to know what data go with which run or with which sample - the Excel file storing that information gets lost, or the README file didn't written. The situation becomes even worse when Next Gen instruments are run in the context of a core lab. Now the problem is exacerbated because you need to make the data available to your customers. Do you set up an FTP site? Or, do you make unix accounts on the file server for your end users? Or, do you deliver data on firewire drives or multigigabyte flash drives? Or do you just do the work for your client and hope that they do not want to reanalyze their data?

Geospiza has solved the first part of the problem. Our new product FinchLab Next Gen Edition allows labs to track sample preparation (DNA library construction, reagent kit tracking, and workflow organization) and link data to runs and samples. FinchLab Next Gen Edition also provides interfaces so that core labs can create a variety of order forms for any kind of service to link data to runs, samples, and orders making data accessible to outside parties (customers) through the FinchLab web browser user interface. And, all of this can be done without any custom programming services. Over the next few weeks, I'll fill in the details on how we can do that. For now, I'll focus on the poster, but make a final important point that FinchLab Next Gen Edition not only interoperates with all of the current Next Gen instruments, it also allows labs to integrate these data with current Sanger sequencing technologies in a common web interface.

With sample tracking and data management under control, the next challenge becomes what to do with the data that are collected. The scientific community is just at the beginning of this journey. In the past, bioinformatics efforts have emphasized algorithm (tools) development over the implementation details associated with data management, organization, and intuitive user interfaces. The result is software systems, built from point solutions, that do not adequately address problems outside of “expert” organizations. If a scientist wants to work with sequence data to understand a biological research problem, they must overcome the challenges of moving data between complex software programs, reformatting unstructured text files, traversing web sites, and writing programs and scripts to mine new output files before they can even begin to gain insights into their problem of interest. While formats and standards have always been discussed and debated many people working with Next Gen data understand the "single point" solution approaches of the past will not scale to today's problems.

That's where HDF fits in. It is clear to the community that new software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets being created by new technology. Geospiza is working with The HDF Group (THG, www.hdfgroup.org) to deliver these capabilities by building on a recognized technology that has proven its ability to meet similar scalability demands in other areas of science [1]. We call the extensible domain-specific data technologies that will be built "BioHDF.” BioHDF will provide the DNA sequencing community with a standardized infrastructure to support high-throughput data collection and analysis and engage an informatics group (THG) that is highly experienced in large-scale data management issues. The technology will make it possible to overcome current computational barriers that hinder analyses. Computer scientists will not have to “reinvent” file formats when developing new computational tools or interfaces to meet scalability demands and biologists will have new programs with the improved performance needed to work with larger datasets.

In the next post, I'll present the case for HDF.

Reference: [1] HDF-EOS "HDF-EOS Tools and Information Center," http://hdfeos.org

Get the poster: NextGenHDF.pdf

Thursday, February 21, 2008

New Finch Best Practices Guides Available!

The Geospiza Support Team has released 2 more Best Practice Guides!

These guides are created for the IT professionals and system administrators responsible for your Finch server.

The 2 new guides are:

1) Stopping and Starting your Finch Server's Services. This guide reviews the proper way to start and stop your Finch server's services and discusses some known problems that you might encounter. It's a great reference tool to if your server isn't responding to http, or if you're writing documentation for backup procedures.

2) The Finch Suite Installation and Configuration Guide: This document is aimed towards fresh installations of Finch, however, however, it should serve as an excellent reference guide for your existing server as well.

You can download these documents from the Online Support Documentation website under the 'Best Practice Guides' section, located at the bottom of the page.

Saturday, February 16, 2008

Entering information in iFinch via FinchTV, part I

Teaching is a hard habit to break so I teach short courses now and then.

This year, I've been having my students use FinchTV to enter their blast results into iFinch. This also works with FinchLab and other Finch systems, too.

This has been pretty helpful. The data get stored for each chromatogram and we can all view the results (I'll address this part in a later post.)

How does this work?

1. Log in to your Finch account. Open a chromatogram in FinchTV either by clicking the FinchTV icon or the link from the Chromat Read page that says Open in FinchTV.

2. When you're ready to enter information, click the Commit button (outlined below).

3. You'll see a message appear asking if you're sure. Say "yes."

4. Enter the information that you want to store. Since we were using FinchTV to connect to NCBI blast and identify our bacteria, I'm entering the conclusion from my blast results.

5. Then I click the "OK" button.

6. If I refresh my web browser page, I can see that the version number for my read is now at "2", and I can see that my information has been stored in the database.

In a later post, I'll show how we get that information out.

Stay tuned...

Thursday, February 14, 2008

Conference Blogging from Marco Island and ABRF

What a week! I've just returned from the AGBT and ABRF meetings in Marco Island Florida and Salt Lake City Utah. I can say the next generation of DNA sequencers are driving a renaissance in DNA sequencing. In the past people would ask me if DNA sequencing was going to be replaced by another technology. I always thought, what a silly thing to say, many of the other technologies are used because sequencing was expensive, and I heard more than one microarray talk start with the "if I could have sequenced, I would have" apology. After all, the first thing someone wants when they have some sequence data is more sequence data.

Next Gen is the evidence that this statement is true.

At AGBT - Applied Biosystems, Illumina and Roche - the big three - had huge presences. Helicos, The Polonator, and Pacific Biosciences were there to provide alternatives and demonstrate that the next "next gen" technologies are just around the corner. At ABRF, the theme continued with more AB, Illumina, and Roche presentations as well as Visigen, and other groups and companies presenting sequencing, sample preparation, and other technologies to make systems more sensitive and keep us on the path to the $1000 genome. All of the talks had this common message: large sequence data sets are extremely powerful, but only if you can deal with the informatics.

That's where we can help!

At ABRF we demonstrated the third version of our Finch software platform with the first integrated and complete LIMS product for all Next Gen sequencing platforms. We call this FinchLab Next Gen Edition. Everyone who came by our booth was blown away. At AGBT we presented our work with BioHDF. In the coming weeks we will write articles about FinchLab, the FinchLab Next Gen Edition, and BioHDF.

Stay tuned...

Wednesday, February 13, 2008

Geospiza and Isilon Systems find room for Next Gen sequencing data

Some of you were probably wondering where we were going to put all of that Next Generation Sequencing data. We found a place. By working with Isilon Systems, we get to combine their strength in data storage with our expertise in data management. It's a wonderful match!

From the press release:

Geospiza Announces OEM Agreement with Isilon Systems to Integrate Isilon X-Series Clustered Storage Systems with FinchLab Next Gen Edition Software Product

New Relationship Will Deliver High Performance, Integrated Product for Genetic Analysis

Seattle, WA February 11, 2008 — Geospiza, a leading provider of fully integrated laboratory information management systems (LIMS) and data management infrastructure solutions for genetic analysis, today announced it has entered into an OEM agreement with Isilon® Systems, the leader in clustered storage, to link the Isilon X-Series clustered storage system with FinchLab Next Gen Edition software product. The new relationship will lead to the availability of a complete IT Infrastructure Solution for Genetic Analysis.

The integrated FinchLab Next Gen Edition product delivers a comprehensive solution for laboratory operators and end users of Genetic Analysis information that includes data management systems to define experiments, and the ability to track data through production, and process genetic analysis platforms in a scaleable high capacity storage system.

“Our customers are rapidly adopting Next Generation Sequencing systems.” said Rob Arnold, President, Geospiza. “With a 1500 fold increase in data production over CE per run, new tools are required to manage the enormous volume of data produced by Next Gen Sequencing. We are delighted to be working with Isilon to deliver a platform specifically designed to scale with the lab as their needs grow.”

“Isilon is very excited to be working with an industry leader like Geospiza,” said Tony Regier, Vice President of Global Sales Partners, Isilon Systems. “We believe our combined offering will deliver productivity improvements to laboratory operators so end users can focus more time on the science and less on the complexities of IT administration.”

The companies have already started working with early access customers and plan to release the combined product in April 2008.

Monday, February 11, 2008

iFinch in education: metagenomics with JHU, part I.

iFinch is the perfect bioinformatics tool to accompany a class. I used it Fall quarter in a class that I teach at Shoreline Community College (Washington) and I'm using it right now in an on-line class that I teach at Austin Community College (Texas).

We cover several different topics in the class, but I have a fondness for long projects where we can use multiple techniques and tie everything to a common theme.

This semester we're working with bacterial sequences that were obtained from students at John Hopkins University. I've been collaborating with an instructor there for several years and now we have four years of data to dig our teeth into.

This video describes the first part of the project that we're working on.

JHU bacterial metagenomics project from Sandra Porter on Vimeo.

Using the Finch Q >20 plots to evaluate your data

All of the Finch systems: Solutions Finch, FinchLab, and iFinch; have a folder report with visual snapshots that summarize the quality of data in that folder. The Q20 histogram plot is one of those tools and in these next two posts, I'll describe what we can learn from these plots.

First, we'll talk about the values on the x axis. When we use the term "Q> 20 bases," we're referring to the number of bases in a read that have a quality value greater than 20. If a base has a quality value of 20, there is a 1 in 100 chance that the base has been misidentified. We use the Q20 value to mark a threshold point where a base has an acceptable quality value.

Histogram plots work by consolidating data that fit into a certain range. In the graph above, you can see that on the x axis, we show groups of reads. The first group contains reads that have less than 50 good (Q > 20) bases. The next group contains reads that have between 50 and 99 good bases, next 100 to 149, and so on.

On the y axis, we show the number of reads that fall into each group. In this graph, we have almost 30 reads that have over 950 good quality bases.

Uhmm, uhmm, uhhmmm, good sequence data, just the stuff I like to see.

Thursday, February 7, 2008

Geospiza, ABI, and Next Generation Sequencing

One of the chief complaints we hear in articles on Next Generation sequencing technologies has concerned the lack of bioinformatics support on the part of the Next Gen vendors. Lab directors have described feeling like they've been left on their own to when it comes to solutions for handling the enormous amount of data produced by Next Gen instruments. I suspect that the people at Applied Biosystems must read the same articles.

This morning, Applied Biosystems issued a response to those complaints. Today's press release describes how ABI and Geospiza are working together to address the Next Generation problem of managing Next Generation sequencing data.

That's right. We've got a Next Generation FinchLab system that's ready to go AND it supports the ABI's SOLiD platform. You can even try it out if you're attending the ABRF conference in Utah next week. (See us in booth 210, pass it on!)

Now, I'm going to quote the best parts from the press release and interject just a little bit of commentary:

Geospiza Offers IT Infrastructure Solution
As part of Geospiza’s participation in Applied Biosystems’ Software Development Community, Geospiza developed a software system designed to automate sequencing workflows for capillary electrophoresis (CE) instrumentation. Applied Biosystems’ agreement with Geospiza is expected to extend Geospiza’s Finch Suite® software and its knowledge of 3130 Series Genetic Analyzers and 3730 Series DNA Analyzers to SOLiD System-specific IT infrastructure support, software and tools with its new FinchLab™ Next Gen Edition product.

As life scientists expand their research capabilities by using the SOLiD System, Geospiza expects to support these customers by processing both CE and SOLiD data through a single data processing pipeline, which will enable them to integrate and visualize the two data sets from both technologies. This integrated solution is expected to enable researchers to utilize both the SOLiD System for discovery applications and CE systems for validation, connecting the research continuum.

[snip]

Yes, that's right. The Next Gen Edition of FinchLab handles both Next Gen and Sanger sequencing. You can use the same system to manage both kinds of data. Cool, huh?

[snip]

This development is expected to benefit many organizations, including the University of Washington, which is a user of the FinchLab software and an early adopter of the SOLiD System. Scientists in the university’s High Throughput Sequencing Solutions Laboratory recognize the challenges associated with the vast amount of data being generated from next-generation sequencing systems.

“As a Geospiza FinchLab customer and a laboratory that has acquired a SOLiD System sequencer, any collaborative effort between Geospiza and Applied Biosystems to help laboratory directors meet the coming next-generation sequencer data management and analytics challenges would be a welcome relationship for the research community,” said Dr. Michael Dorschner, the director of the University of Washington’s High Throughput Sequencing Solutions Laboratory.

Geospiza plans to deepen the integration between its FinchLab Next Gen Edition software product and the SOLiD System to provide a comprehensive solution that includes data management systems to define experiments, and the ability to track data through production, and process genetic analysis platforms in a scaleable high capacity storage system.

“By expanding our long-standing relationship with Applied Biosystems, we are answering the call for bioinformatics solutions to help accelerate research on next-generation genomics analysis platforms,” said Rob Arnold, President, Geospiza. “Laboratory directorsand end users should benefit equally from the more natural workflow as they switch between the software and the instrument for data generation, management and analysis. Our platform is specifically designed to scale with the lab as their needs grow.”

If you want to read the rest, you can find it here. If you want to learn more about surviving the Next Gen data onslaught, send us an e-mail or give us a call.