Monday, March 31, 2008

Next Gen, Next Step

Congratulations! You just got approval to purchase your next generation sequencer! What are you going to do next?

Today, there is a lot being written about the data deluge accompanying Next Gen sequencers. It's true, they produce a lot of data. But even more important are the questions about how you plan to set up the lab and data workflows to turn those precious samples into meaningful information. The IT problems, while significant, are only the tip of the iceberg. If you operate a single lab, you will need to think about your experiments, how to track your samples, how to prepare DNA for analysis, how to move the data around for analysis, and how to do your analyses to get meaningful information out of the data. If you operate a core lab, you have all the same problems, but you're providing that service for a whole community of scientists. You'll need to keep their samples and data separated and secure. You also have to figure out how to get the data to your customers and how you might help them with their analyses.

Never mind that you need multi terabytes of storage and a computer cluster. Without a plan and strategy for running your lab, organizing the data, running multistep analysis procedures, and sifting through 100's of thousands of alignments, you'll just end up with a piece of lab art: a Next Gen sequencer, a big storage system and a computer cluster. (By the way, have you found a place for this yet?) It may look nice, but that's probably not what you had in mind.

To get the most of out of your investment, you'll need to think about workflows, and how to manage those workflows.

The cool thing about Next Gen technology are the kinds of questions that can be asked with the data. This requires both novel ways to work with DNA and RNA and novel ways to work with the data. We call those procedures "workflows." Simply put, a workflow describes a multistep procedure and its decision points. In each step, we work with materials and the materials may be "transformed" in the step. You can also describe a workflow as a series of steps that have inputs and outputs. Workflows are run both in the lab and on the computer.

In a protocol for isolating DNA , we can take tissue (the input) lyse the cells with detergent, bind the DNA to a resin, wash away junk, and elute purified DNA (the output). The purified DNA may then become an input to a next step, like PCR, to create an output, like a collection of amplicons. Similar processes can be used with RNA. In a Next Gen lab workflow, you fragment the DNA, ligate adaptors, and use the adaptors to attach DNA to beads or regions of a slide. From a few basic lab workflows, we can prepare genetic material for whole genome analysis, expression analysis, variation analysis, gene regulation, and other experiments in both discovery, and diagnostic assays.

In a software workflow, data are the material. Input data, typically packaged in files, are processed by programs to create output data. These data or information can also be packaged in files or even stored in databases. Software programs execute the steps and scripts often automate series of steps. Digital photography, multimedia production, and business processes all have workflows. So does bioinformatics. The difference is that bioinformatics workflows lack standards so many people work harder than needed and spend a lot of time debugging things.

As the scale increases, the lab and analysis workflows must be managed together.

A common laboratory practice has been to collect the data, and then analyze the data in separate independent steps. Lab work is often tracked on paper, in Excel spreadsheets, or in a LIMS (Laboratory Information Management System). The linkage between lab processes, raw data, and final results, is typically poor. In small projects, this is manageable. File naming conventions can track details and computer directories (folders) can be used to organize data files. But as the scale grows, the file names get longer and longer, people spend considerable time moving and renaming data, the data start to get mixed up, become harder to find, and for some reason files start to replicate themselves. Now, the lab investigates tracking problems and lost data, instead of doing experiments.

Why? Because the lab and data analysis systems are disconnected.

The good news is that Geospiza Finch products can link your lab procedures and the data handling procedures to create complete workflows for genetic analysis.

No comments: