FinchTalk: May 2008

Monday, May 26, 2008

Finch 3: Managing Workflows

Genetic analysis workflows begin with RNA or DNA samples and end with results. In between, multiple lab procedures and steps are used to transform materials, move samples between containers, and collect the data. Each kind of data collected and each data collection platform requires that different laboratory procedures are followed. When we analyze the procedures, we can identify common elements. A large number of unique workflows can be created by assembling these elements in different ways.

In the last post, we learned about the FinchLab order form builder and some of its features for developing different kinds of interfaces for entering sample information. Three factors contribute to the power of Finch orders. First, labs can create unique entry forms by selecting items like pull down menus, check boxes, radio buttons, and text entry fields for numbers or text, from a web page. No programming is needed. Second, for core labs with business needs, the form fields can be linked to diverse price lists. Third, the subject of this post, is that the forms are also linked to different kinds of workflows.

What are Workflows?

A workflow is a series of series of steps that must be performed to complete a task. In genetic analysis, there are two kinds of workflows: those that involve laboratory work, and those that involve data processing and analysis. The laboratory workflows prepare sample materials so that data can be collected. For example, in gene expression studies, RNA is extracted from a source material (cells, tissue, bacteria), and converted to cDNA for sequencing. The workflow steps may involve purification, quality analysis on agarose gels, concentration measurements, and reactions where materials are further prepared for additional steps.

The data workflows encompass all the steps involved in tracking, processing, managing, and analyzing data. Sequence data are processed by programs to create assemblies and alignments that are edited or interrogated to create genomic sequences, discover variation, understand gene expression, or perform other activities. Other kinds of data workflows such as microarray analysis, or genotyping involve developing and comparing data sets to gain insights. Data workflows involve file manipulations, program control, and databases. The challenge for the scientist today, and the focus of Geospiza's software development is to bring the laboratory and data workflows together.

Workflow Systems

Workflows can be managed or unmanaged. Whether you work at the bench or work with files and software, you use a workflow any time you carry out a procedure with more than one step. Perhaps you wite the steps in your notebook, check them off as you go, and tape in additional data like spectrophotometer readings or photos. Perhaps you write papers in Word and format the bibliography with Endnote or resize photos with Photoshop before adding them to a blog post. In all these cases you performed unmanaged workflows.

Managing and tracking workflows becomes important as the number of activities and number of individuals performing them increase in scale. Imagine your lab bench procedures performed multiple times a day with different individuals operating particular steps. This scenario occurs in core labs that perform the same set of processes over and over again. You can still track steps on paper, but it's not long before the system becomes difficult to manage. It takes too much time to write and compile all of the notes, and it's hard to know which materials have reached which step. Once a system goes beyond the work of a single person, paper notes quit providing the right kinds of overviews. You now need to manage your workflows and track them with a software system.

A good workflow system allows you to define the steps in your protocols. It will provide interfaces to move samples through the steps and also provide ways to add information to the system as steps are completed. If the system is well-designed, it will not allow you do things at inappropriate times or require too much "thinking" as the system is operated. A well-designed system will also reduce complexity and allow you to build workflows through software interfaces. Good systems give scientists the ability to manage their work, they do not require their users to learn arcane programming tools or resort to custom programming. Finally, the system will be flexible enough to let you create as many workflows as you need for different kinds of experiments and link those workflows to data entry forms so that the right kind of information is available to right process.

FinchLab Workflows

The Geospiza FinchLab workflow system meets the above requirements. The system has a high level workflow that understands that some processes require little tracking (a quick test) and other's require more significant tracking ("I want to store and reuse DNA samples"). More detailed processes are assigned workflows that consist of thee parts: A name, a "State," and a

"Status." The "State" controls the software interfaces and determines which information are presented and accessed at different parts of a process. A sequencing or genotyping reaction, for example, cannot be added to a data collection instrument until it is "ready." The other part specifies the steps of the process. The steps of the process (Statuses) are defined by the lab and added to a workflow using the web interfaces. When a workflow is created, it is given a name, as many steps as needed, and it is assigned a State. The workflows are then assigned to different kinds of items so that the system always knows what to do next with the samples that enter.

A workflow management system like FinchLab makes it just as easy to track the steps of Sanger DNA sequencing, as it is to track the steps of a Solexa, SOLiD, or 454 sequencing processes. You can also, in the same system, run genotyping assays and other kinds of genetic analysis like microarrays and bead assays.

Next time, we'll talk about what happens in the lab.

Tuesday, May 20, 2008

Finch 3: Defining the Experimental Information

In today's genetic analysis laboratory, multiple instruments are used to collect a variety of data ranging from DNA sequences to individual values that measure DNA (or RNA) hybridization, nucleotide incorporations, or other binding events. Next Gen sequencing adds to this complexity and offers additional challenges with the amount of data that can be produced for a given experiment.

In the last post, I defined basic requirements for a complete laboratory and data management system in the context of setting up a Next Gen sequencing lab. To review, I stated that laboratory workflow systems need to perform the following basic functions:

Allow you set up different interfaces to collect experimental information
Assign specific workflows to experiments
Track the workflow steps in the laboratory
Prepare samples for data collection runs
Link data from the runs back to the original samples
Process data according to the needs of the experiment

I also added that if you operate a core lab, you'll want to bill for your services and get paid.

In this post I'm going to focus on the first step, collecting experimental information. For this exercise let's say we work in a lab that has:

One Illumina Solexa Genome Analyzer
One Applied Biosystems SOLiD System
One Illumina Bead Array station
Two Applied Biosystems 3730 Genetic Analyzers, used for both sequencing and fragment analysis

This image shows our laboratory home page. We run our lab as a service lab. For each data collection platform we need to collect different kinds of sample information. One kind of information is the sample container. Our customer's samples will be sent the lab in many different kinds of containers depending on the kind of experiment. Next Gen sequencing platforms like SOLiD, Solexa, and 454 are low throughput with respect to sample preparation, so samples will be sent to us in tubes. Instruments like the Bead Array and 3730 DNA sequencing instrument, usually involve sets of samples in 96 or 384 well plates. In some cases, samples start in tubes and end up in plates, so you'll need to determine which procedures use tubes and which use plates and how the samples will enter the lab.

Once the samples have reached the lab, and been checked, you are also going to do different things to the samples in order to prepare them for the different data collection platforms. You'll want to know which samples should go to what platforms and have the workflows for different processes defined so that they are easy to follow and track. You might even want to track and reuse certain custom reagents like DNA primers, probes and reagent kits. In some cases you'll want to know physical information, like DNA, RNA, or concentration, upfront. In other cases you'll determine information later.

Finally, let's say you work at an institution that focuses on a specific area of research, like cancer, or mouse genetics, or plant research. In these settings you might want to also track information about sample source. Such information could include species, strain, tissue, treatment or many other kinds of things. If you want to explore this information later you'll probably want to define a vocabulary that can be "read" by computer programs. To ensure that the vocabulary can be followed, interfaces will be needed to enter this information without typing or else you'll have a problem like pseudomonas, psuedomonas, or psudomonas.

Information systems that support the above scenarios have to deal with a lot of "sometimes this" and "sometimes that" kinds of information. If one path is taken, Sanger sequencing on a 3730, different sample information and physical configurations are needed than we need with Next Gen sequencing. Next Gen platforms have different sample requirements too. SOLiD and 454 require emulsion PCR to prepare sequencing samples, whereas Solexa, amplifies DNA molecules on slides in clusters. Additionally, the information entry system also has deal with "I care" and "I don't care" kinds of data like information about sample sources, or experimental conditions. These kinds of information are needed later to understand the data in the context of the experiment, but do not have much impact on the data collection processes.

How would you create a system to support these diverse and changing requirements?

One way to do this would be to build a form with many fields and rules for filling it out. You know those kinds of forms. They say things like "ignore this section if you've filled out this other section." That would be a bad way to do this, because no one would really get things right, and the people tasked with doing the work would spend a lot of time either asking questions about what they are supposed to be doing with the samples or answering questions about how to fill out the form.

Another way would be to tell people that their work is too complex and they need custom solutions for everything they do. That's expensive.

A better way to do this would be to build a system for creating forms. In this system, different forms are created by the people who develop the different services. The forms are linked to workflows (lab procedures) that can understand sample configurations (plates, tubes, premixed ingredients, and required information). If the systems is really good, you can easily create new forms and add fields to them to collect physical information (sample type, concentration) or experimental information (tissue, species, strain, treatment, your mothers maiden name, favorite vacation spot, ...) without having to develop requirements with programmers and have them build forms. If your system is exceptionally good, smart, and clever it will let you create different kinds of forms and fields and prevent you from doing things that are in direct conflict with one another. If your system is modern, it will be 100% web-based and have cool web 2.0 features like automated fill downs, column highlighting, and multi-selection devices so that entering data is easy, intuitive, and even a bit fun.

FinchLab, built on the Finch 3 platform, is such a system.

Tuesday, May 6, 2008

The Next Gen Sequencing Lab

Illumina's Genome Center in a mailroom message really captures the impact of next generation sequencing technology. Each Illumina Genome Analyzer, AB SOLiD instrument, or Roche Genome Sequencer (454) has the per run capacity of a Genome Center's daily output. More importantly this is possible because you can do your DNA prep work on a single lab bench. Of course you'll have to find someplace to put the data.

In the old days (last year) if you wanted to collect data on a genome center scale, you had to not only have a large warehouse with 100's of capillary electrophoresis genetic analyzers, you also had to have multiple large rooms that were devoted to sample preparation. In the largest genome centers, one full room is used to prepare DNA libraries, another is used to purify DNA templates and finally a large space is need to run the sequencing reactions (we're not even talking about media, autoclaves and other support). Multiple robots are required to pick bacterial colonies, transfer liquids between 384-well plates, and aliquot purified DNA, primers, and enzyme/nucleotide cocktails. To support these activities a small army of technicians work to set up the materials, move plates through the process, and load the instruments. This is all tracked by a custom LIMS (Laboratory Information Management System) and team of developers who keep it running and develop tools to process the data.

With Next Gen sequencing all of this is replaced by a mailroom, laboratory bench, and a couple of people.

While you can make do with less space, fewer people, robotics, and custom LIMS systems, you do need to track what is happening at the bench. You are probably also going to want to know which of those many thousands of files go with what samples. Today's Next Gen systems allow you to partition your sequencing materials into slide chambers (also called lanes and sections) to give between eight and 32 separate data sets per run. To track samples and lab workflows, and link data and results together you will need to have a software system that can perform the following basic functions:

Allow you set up different interfaces to collect experimental information
Assign specific workflows to experiments
Track the workflow steps in the laboratory
Prepare samples for data collection runs
Link data from the runs back to the original samples
Process data according to the needs of the experiment

And if you are a core lab you'll likely want to set up experiments as services and create billing statements for the work.

Traditionally, this kind of system was only possible through custom software development, either you did it yourself or you worked with a company to build the features that were needed. Now you can get this support in a software product that is quick to deploy and can be configured to your needs. Over the coming weeks and months I'll show you how this can be done with the Geospiza FinchLab. If you want to know now give us a call.