Wednesday, October 22, 2008

Journal Club: Focus on Next Gen Sequencing

Yesterday I received my issue of Nature Biotechnology. This month it features Next-Generation (Next Gen) Sequencing. One editorial, one profile, three news features, a commentary, two perspectives, and two reviews discuss the origins, trials, tribulations and what’s coming next in Next Gen. For now, I'll focus on the editorial.

Bioinformatics is a big big issue

“If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow” was the closing statement in the lead editorial “Prepare for the deluge.

Reminds me of something I said a few months back.

In the editorial, Nature Biotechnology (NBT) makes a number of important points starting with how the launch of the Roche/454 pyrosequencer in 2005 could generate as much data as more than 50 ABI capillary sequencers. Since that launch, we have seen new instruments emerge that are producing ever increasing amounts of data by orders of magnitude. Or as NBT put it “The overwhelming amounts of data being produced are the equivalent of taking a drink from a fire hose.”

It's like they read our web site (we ran the image below at the beginning of the year).

The volumes of data and new ways in which it must be worked with are creating many challenges. To begin, there is the conundrum of what to keep; do you keep raw images and processed reads? Or do you just keep the reads? If you keep raw images, the costs are significant. The cost of storing all that information must be considered in the context of the likelihood of whether you will ever need to go back to these data. We call this the data life cycle.

From raw images, the next challenge is the computational infrastructure needed to process reads and obtain meaningful information. This is a complex process that involves many steps and high performance computers. NBT made the accurate and important point that the instrument manufacturer only provide the software to analyze what comes off of the machine for common applications. A great deal of bioinformatics support is needed for downstream analysis once the initial data alignments or assemblies are completed. Also, standards for comparing data between instrument platforms are lacking. This makes it difficult to compare results from different instruments.

While more is needed in terms of bioinformatics support, being able to get tools for alignment and assembly is a good starting point and NBT lauded ABI’s SOLiD community program as a step in the right direction. This kind of approach is also needed by the other instrument vendors. Presently Illumina and Roche include their tools with an instrument purchase. This is fine for the laboratory, but it makes a hard problem harder for any researchers who might be getting data sets from different labs. This could lead to threads of frustration.

As the article continued, the "overwhelmed" scale increased to dire.

NBT stated:
“What all of this means is that for the foreseeable future, next-generation sequencing platforms may remain out of the hands of labs lacking the deep pockets needed for bioinformatics support.”
They also added,
“Thus, if the next-generation platforms are to truly democratize sequencing—bringing genomics out of the main sequencing centers and into the laboratories of single investigators or small academic consortia—much more effort needs to be expended in developing cost-effective software and data management solutions.

NBT offered some solutions, including getting the instrument vendors to develop community based solutions, and encouraging the grant funding organizations to fund bioinformatics as much as they fund sequencing.

Is Next Gen for everyone?

The NBT editors made a lot of great points, but we do not see the world in as dire terms as they do. Yes, a great challenge to Next Gen and getting up and running with this equipment includes preparing for the informatics challenges that await. Next Gen is not Sanger. You cannot look at every read to figure out what your data mean and you will need a serious computational infrastructure to store, organize and work with the data. Also, not mentioned in the article, but incredibly important, you will need a laboratory information management system to organize your experimental information and track the many steps needed to prepare good DNA libraries for sequencing.

And, there are solutions.

Geospiza’s FinchLab combined with our Software as a Service (SaaS) delivery, provides immediate access to the necessary software and hardware infrastructure to run these new instruments.

FinchLab delivers the software infrastructure to support laboratory workflows for all the platforms, links the resulting data to samples, and - through a growing list of data analysis pipelines and visualization interfaces - provides the necessary bioinformatics for a wide range of sequencing applications. Further, our bioinformatics approach is community-based. We are working with the best tools as they emerge and are collaborating with multiple groups to advance additional research and development.

SaaS delivers the computing infrastructure on demand. With our SaaS model, the computer infrastructure is always available and grows with your needs. You do not have to set up a large computer system, or build a new building, or risk over or under investing to deal with the data.

With FinchLab, the vision of next-generation platforms truly democratizing sequencing can be realized.

Friday, October 17, 2008

Uploading your data to iFinch

iFinch is a scaled down version of our V2 Finch system for genetic analysis. 

Unlike our larger, industrial strength systems, iFinch is designed for individual researchers, small labs, or teachers who want a trouble-free system for managing and working with genetic data.  Currently, students and teachers are using iFinch as part of the Bio-Rad Explorer Cloning and Sequencing kit.

I call iFinch "bioinformatics in a box." I've used iFinch in two bioinformatics courses and it's been pretty helpful. iFinch and FinchTV play nicely together and the combination works well for students.

You don't even have to get a computer for storing data or learn how to manage a database. We do all that for you and you use the system through the web. It's nice and painless.

If you received an iFinch account from Bio-Rad, you will need to turn on your data processor before you begin uploading data.

Checking and starting your Finch data processor
1.  Log into your iFinch account.
2.  Find and select the Data Processor link in the System menu.

3. Look at the Data processor status.

4. If the Data processor has stopped, you will need to Restart it by selecting the Restart button.  If you are a student, you will need to have an instructor log in and do this.

Once your data processor has been started, you can go ahead and upload your data as shown in the movie below.

Uploading your data
The first thing we do with iFinch is to put our data into the iFinch database. In the movie, you can see how we upload chromatograms through the web interface.

iFinch can store any kind of file, but it really shines when it comes to working with chromatograms or genotyping data.

If you have lots of files (more than a few 96 well plates), we do have other systems for uploading data. But, that's another post and another movie.

Wednesday, October 8, 2008

Road Trip: AB SOLiD Users Meeting

Wow! That's the best way to summarize my impressions from the Applied Biosystems (AB) SOLiD users conference last week, when AB launched their V3 SOLiD platform. AB claims that this system will be capable of delivering a human genome's worth of data for about $10,000 US.

Last spring, the race to the $1000 genome leaped forward when AB announced that they sequenced a human genome at 12-fold coverage for $60,000. When the new system ships in early 2009, that same project can be completed for $10,000. Also, this week others have claimed progress towards a $5000 human genome.

That's all great, but what can you do with this technology besides human genomes?

That was the focus of the SOLiD users conference. For a day and a half, we were treated to presentations from scientists and product managers from AB as well as SOLiD customers who have been developing interesting applications. Highlights are described below.

Technology Improvements:

Increasing Data Throughput - Practically everyone is facing the challenge of dealing with large volumes of data, and now we've learned the new version of the SOLiD system will produce even more. A single instrument run will produce between 125 million to 400 million reads depending on the application. This scale up is achieved by increasing the bead density on a slide, dropping the overall cost per individual read. Read lengths are also increasing, making it possible to get between 30 and 40 gigabases of data from a run. And, the amount of time required for each run is shrinking; not only can you get all of these data, you can do it again more quickly.

Increasing Sample Scale - Many people like to say, yes, the data is a problem, but at least the sample numbers are low, so sample tracking is not that hard.

Maybe they spoke too soon.

AB and the other companies with Next Gen technologies are working to deliver "molecular barcodes" that allow researchers to combine multiple samples on a single slide. This is called "multiplexing." In multiplexing, the samples are distinguished by tagging each one with a unique sequence, the barcode. After the run, the software uses the sequence tags to sort the data into their respective data sets. The bottom line is that we will go from a system that generates a lot of data from a few samples, to a system that generates even more data from a lot of samples.


What you can do with 100's of millions of reads: On the science side, there were many good presentations that focused on RNA-Seq and variant detection using the SOLiD system. Of particular interest was Dr. Gail Payne's presentation on the work, recently published in Genome Research, entitled "Whole Genome Mutational Profiling Using Next Generation Sequencing Technology." In the paper, the 454, Illumina, and SOLiD sequencing platforms were compared for their abilities to accurately detect mutations in a common system. This is one of the first head to head to head comparisons to date. Like the presidential debates, I'm sure each platform will be claimed to be the best by its vendor.

From the presentation and paper, the SOLiD platform does offer a clear advantage in its total throughput capacity. 454 showed showed the long read advantage in that approximately 1.5% more of the yeast genome studied was covered by 454 data than with shorter read technology. And, the SOLiD system, with its dibase (color space) encoding, seemed to provide higher sequence accuracy. When the reads were normalized to the same levels of coverage, a small advantage for SOLiD, can be seen.

When false positive rates of mutation detection were compared, SOLiD had zero for all levels of coverage (6x, 8x, 10x, 20x, 30x, 175x [full run of two slides]), Illumina had two false positives at 6x and 13x, and zero false positives for 19x and 44x (full run of one slide) coverage, and 454 had 17, six, and one false positive for 6x, 8x, and 11x (full run) coverage, respectively.

In terms of false negative (missed) mutations, all platforms did a good job. At coverages above 10x, none of the platforms missed any mutations. The 454 platform missed a single mutation at 6x and 8x coverage and Illumina missed two mutations at 6x coverage. SOLiD, on the other hand, missed four and five at 8x and 6x coverage, respectively.

What was not clear from the paper and data, was the reproducibility of these results. From what I can tell, single DNA libraries were prepared and sequenced; but replicates were lacking. Would the results change if each library preparation and sequencing process was repeated?

Finally, the work demonstrates that it is very challenging to perform a clean "apples to apples" comparison. The 454 and Illumina data were aligned with Mosiak and the SOLiD data were aligned with MapReads. Since each system produces different error profiles and the different software programs each make different assumptions about how to use the error profiles to align data and assess variation, the results should not be over interpreted. I do, however, agree with the authors, that these systems are well-suited for rapidly detecting mutations in a high throughput manner.

ChIP-Seq / RNA-Seq: On the second day, Dr. Jessie Gray presented work on combining ChIP-Seq and RNA-Seq to study gene expression. This is important work because it illustrates the power of Next Gen technology and creative ways in which experiments can be designed.

Dr. Gray's experiment was designed to look at this question: When we see that a transcription factor is bound to DNA, how do we know if that transcription factor is really involved in turning on gene expression?

ChIP-Seq allows us to determine where different transcription factors are bound to DNA at a given time, but it does not tell us whether that binding event turned on transcription. RNA-Seq tells us if transcription is turned on, after a given treatment or point in time, but it doesn't tell us which transcription factors were involved. Thus, if we can combine ChiP-Seq and RNA-Seq measurements, we can elucidate a cause and effect model and find where a transcription factor is binding and which genes it potentially controls.

This might be harder than it sounds:

As I listened to this work, I was struck by two challenges. On the computational side, one has to not only think about how to organize and process the sequence data into alignments and reduce those aligned datasets into organized tables that can be compared, but also how to create the right kind of interfaces for combining and interactively exploring the data sets.

On the biochemistry side, the challenges presented with ChIP-Seq reminded me of the old adage of trying to purify disapearase - "the more you purify the less there is." ChIP-Seq and other assays that involve multiple steps of chemical treatments and purification, produce vanishingly small amounts of material for sampling. The later challenge complicates the first challenge, because in systems where one works with "invisible" amounts of DNA, a lot of creative PCR, like "in gel PCR" is required to generate sufficient quantities of sample for measurement.

PCR is good for many things, including generating artifacts. So, the computation problem expands. A software system that generates alignments, reduces them to data sets that can be combined in different ways, and provides interactive user interfaces for data exploration, must also be able to understand common artifacts so that results can be quality controlled. Data visualizations must also be provided so that researchers can distinguish biological observations from experimental error.

These are exactly the kinds of problems that Geospiza solves.

Monday, October 6, 2008

Sneak Peak: Genetic Analysis From Capillary Electrophoresis to SOLiD

On October 7, 2008 Geospiza hosted a webinar featuring the FinchLab, the only software product to track the entire genetic analysis process, from sample preparation, through processing to analyzed results.

If you are as disappointed about missing it as we are about you missing, no worries. You can get the presentation here.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing what makes the Applied Biosystems SOLiD system powerful for transcriptome analysis, CHiP-Seq, resequenicng experiments, and other applications
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we talked about the general applications of Next Gen sequencing and focused on using SOLiD to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling and whole transcriptome analysis. Throughout the talk we gave specific examples about collecting and analyzing SOLiD data and showed how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.