Wednesday, October 8, 2008

Road Trip: AB SOLiD Users Meeting

Wow! That's the best way to summarize my impressions from the Applied Biosystems (AB) SOLiD users conference last week, when AB launched their V3 SOLiD platform. AB claims that this system will be capable of delivering a human genome's worth of data for about $10,000 US.

Last spring, the race to the $1000 genome leaped forward when AB announced that they sequenced a human genome at 12-fold coverage for $60,000. When the new system ships in early 2009, that same project can be completed for $10,000. Also, this week others have claimed progress towards a $5000 human genome.

That's all great, but what can you do with this technology besides human genomes?

That was the focus of the SOLiD users conference. For a day and a half, we were treated to presentations from scientists and product managers from AB as well as SOLiD customers who have been developing interesting applications. Highlights are described below.

Technology Improvements:

Increasing Data Throughput - Practically everyone is facing the challenge of dealing with large volumes of data, and now we've learned the new version of the SOLiD system will produce even more. A single instrument run will produce between 125 million to 400 million reads depending on the application. This scale up is achieved by increasing the bead density on a slide, dropping the overall cost per individual read. Read lengths are also increasing, making it possible to get between 30 and 40 gigabases of data from a run. And, the amount of time required for each run is shrinking; not only can you get all of these data, you can do it again more quickly.

Increasing Sample Scale - Many people like to say, yes, the data is a problem, but at least the sample numbers are low, so sample tracking is not that hard.

Maybe they spoke too soon.

AB and the other companies with Next Gen technologies are working to deliver "molecular barcodes" that allow researchers to combine multiple samples on a single slide. This is called "multiplexing." In multiplexing, the samples are distinguished by tagging each one with a unique sequence, the barcode. After the run, the software uses the sequence tags to sort the data into their respective data sets. The bottom line is that we will go from a system that generates a lot of data from a few samples, to a system that generates even more data from a lot of samples.


What you can do with 100's of millions of reads: On the science side, there were many good presentations that focused on RNA-Seq and variant detection using the SOLiD system. Of particular interest was Dr. Gail Payne's presentation on the work, recently published in Genome Research, entitled "Whole Genome Mutational Profiling Using Next Generation Sequencing Technology." In the paper, the 454, Illumina, and SOLiD sequencing platforms were compared for their abilities to accurately detect mutations in a common system. This is one of the first head to head to head comparisons to date. Like the presidential debates, I'm sure each platform will be claimed to be the best by its vendor.

From the presentation and paper, the SOLiD platform does offer a clear advantage in its total throughput capacity. 454 showed showed the long read advantage in that approximately 1.5% more of the yeast genome studied was covered by 454 data than with shorter read technology. And, the SOLiD system, with its dibase (color space) encoding, seemed to provide higher sequence accuracy. When the reads were normalized to the same levels of coverage, a small advantage for SOLiD, can be seen.

When false positive rates of mutation detection were compared, SOLiD had zero for all levels of coverage (6x, 8x, 10x, 20x, 30x, 175x [full run of two slides]), Illumina had two false positives at 6x and 13x, and zero false positives for 19x and 44x (full run of one slide) coverage, and 454 had 17, six, and one false positive for 6x, 8x, and 11x (full run) coverage, respectively.

In terms of false negative (missed) mutations, all platforms did a good job. At coverages above 10x, none of the platforms missed any mutations. The 454 platform missed a single mutation at 6x and 8x coverage and Illumina missed two mutations at 6x coverage. SOLiD, on the other hand, missed four and five at 8x and 6x coverage, respectively.

What was not clear from the paper and data, was the reproducibility of these results. From what I can tell, single DNA libraries were prepared and sequenced; but replicates were lacking. Would the results change if each library preparation and sequencing process was repeated?

Finally, the work demonstrates that it is very challenging to perform a clean "apples to apples" comparison. The 454 and Illumina data were aligned with Mosiak and the SOLiD data were aligned with MapReads. Since each system produces different error profiles and the different software programs each make different assumptions about how to use the error profiles to align data and assess variation, the results should not be over interpreted. I do, however, agree with the authors, that these systems are well-suited for rapidly detecting mutations in a high throughput manner.

ChIP-Seq / RNA-Seq: On the second day, Dr. Jessie Gray presented work on combining ChIP-Seq and RNA-Seq to study gene expression. This is important work because it illustrates the power of Next Gen technology and creative ways in which experiments can be designed.

Dr. Gray's experiment was designed to look at this question: When we see that a transcription factor is bound to DNA, how do we know if that transcription factor is really involved in turning on gene expression?

ChIP-Seq allows us to determine where different transcription factors are bound to DNA at a given time, but it does not tell us whether that binding event turned on transcription. RNA-Seq tells us if transcription is turned on, after a given treatment or point in time, but it doesn't tell us which transcription factors were involved. Thus, if we can combine ChiP-Seq and RNA-Seq measurements, we can elucidate a cause and effect model and find where a transcription factor is binding and which genes it potentially controls.

This might be harder than it sounds:

As I listened to this work, I was struck by two challenges. On the computational side, one has to not only think about how to organize and process the sequence data into alignments and reduce those aligned datasets into organized tables that can be compared, but also how to create the right kind of interfaces for combining and interactively exploring the data sets.

On the biochemistry side, the challenges presented with ChIP-Seq reminded me of the old adage of trying to purify disapearase - "the more you purify the less there is." ChIP-Seq and other assays that involve multiple steps of chemical treatments and purification, produce vanishingly small amounts of material for sampling. The later challenge complicates the first challenge, because in systems where one works with "invisible" amounts of DNA, a lot of creative PCR, like "in gel PCR" is required to generate sufficient quantities of sample for measurement.

PCR is good for many things, including generating artifacts. So, the computation problem expands. A software system that generates alignments, reduces them to data sets that can be combined in different ways, and provides interactive user interfaces for data exploration, must also be able to understand common artifacts so that results can be quality controlled. Data visualizations must also be provided so that researchers can distinguish biological observations from experimental error.

These are exactly the kinds of problems that Geospiza solves.

No comments: