Thursday, March 13, 2008

What's a Bustard?

For that matter, what's a Firecrest? or a Gerald? Many with an Illumina Genome Analyzer are now learning these are the directories that have the data they may be interested in.

What's in those directories?

In this post, we explore some of the data in the directories, talk about what data might be important, and use FinchLab Next Gen Edition (FinchLab NG) to look at some of the files. In the Next Gen world we are also going to be learning about the data life cycle. When you are thinking about how to store three or four or ten terabytes (TB) of data for each run, and considering that you might run your instrument 40 or 50 times or more in the next year, you might stop and ask the question, "how much of that data is really important and for how long?" That's the data life cycle. It's going to be important.

To begin our understanding, let's look at the data being created in a run. When an Illumina Genome Analyzer (aka Solexa) collects data, many things happen. First, images are collected for each cycle in a run and tile in a lane on a slide. They're pretty small, but there are a lot, maybe 360,000 or so and they add up to the terabytes we talk about. These images are analyzed to create tens of thousands (about 100 gigabytes [GB] worth) of "raw intensity files" that go in the Firecrest directory. Next, a base-calling algorithm reads the raw intensity files to create sequence, quality and other files (about 80 GB worth) that go in the Bustard directory. The last step is the Eland program pipleline. It reads the Bustard files, aligns their data to reference sequences, makes more quality calculations, and creates more files. These data go in the Gerald directory to give about 25 or 30 GB of sequence and quality data.

So, what's the best data to work with? That depends on what problem you are trying to solve. Specialists developing new basecalling tools or alignment tools might focus on the data in Firecrest and Bustard. Most researchers, however, are going to work with data in the Gerald directory. That reduces our TB problem down to a tens of GB problem. That's a big difference!

FinchLab NG can help.

FinchLab NG gives you the LIMS capabilities to run your Next Gen laboratory workflows and track which samples go on which slides and where on the slide the samples go. We call this part the run. When a run is complete you can link the data on your filesystem to FinchLab NG and use the web interfaces to explore the data. You can also link specific data files to samples. So, if you are sharing data or operating a core lab your researchers can easily access their data through their Finch account.

The screen shot below gives an example of how the HTML quality files can be explored. It shows two windows, the one on the left is the FinchLab NG with data for a Solexa run. You can see that the directory has 3606 files and a number are htm summary files. You can find these 12 files in that directory of 3606 files entering "htm" in the Finch Finder.The window on the right was obtained by clicking the "Intensity Plots" link that is directly below the info table and just above the data list. In this example the intensity plots are shown for each tile of the 8 lanes on the slide. To see this better click the image and zoom in with your browser.

No comments: