FinchTalk: 2008

Wednesday, December 31, 2008

Closing 2008

As we bring 2008 to a close, it is a good time to reflect on our progress and think about the new year ahead. Despite the world economic news, both the genomics field in general and Geospiza specifically have many positive accomplishments to show for the year.

In February, we introduced FinchLab for Next Gen Sequencing at the AGBT and ABRF conferences. At these shows, it was clear that Next Gen Sequencing was going to change the ways we think about applying DNA sequencing to interrogate a multitude of genetic and functional genomics problems. Over the course of 2008, many papers were published demonstrating the value of the massively parallel sequencing technology. MassGenomics dubbed 2008: Year of the Cancer Genome. Other blogs are following suit with articles on personal genomics and other advancements provided largely through Next Gen Sequencing.

Throughout the year, we also learned that while you can do a lot with a huge amount of data, working with the data is extremely challenging. Conference presentations and editorials in journals frequently made this point. While many of these articles focused on the data management challenge, groups acquiring the technology were also learning that the challenges go beyond data management. Comprehensive software systems are needed to manage all facets of the process, from tracking how samples are prepared for specific experiments to how the data are stored and organized, to analyzing and presenting the data according to the experiment being performed. In short, we learned that Next Gen technologies produce sequence data in different ways and require that we think about DNA sequencing in new ways.

Geospiza’s Version 3 Software Platform and GeneSifter

To address these new challenges, and expand support for existing technologies, Geospiza accomplished two significant milestones in 2008. First, we released the third version of our software platform that supports both laboratory workflows and data analysis automation. Through this system, laboratories are able to set up different interfaces to collect experimental information, assign specific workflows to experiments, track the workflow steps in the laboratory, prepare samples for data collection runs, link data back to the original samples and process data according to the needs of the experiment - without any programming. More importantly, for those who want to develop data analysis pipelines, the system provides a deployable environment that lets you add new pipelines and make them easily accessible.

The second major milestone was our acquisition of GeneSifter. GeneSifter is an award-winning microarray data analysis product. With GeneSifter , Geospiza can deliver complete end to end systems for data intensive genetic analysis applications like microarrays and Next Gen sequencing based transcription. Also, GeneSifter, like Geospiza’s other products is web-based and can be delivered as a Software as a Service (SaaS) product.

SaaS was one of the important themes for 2008. Geospiza understands well that data intensive science requires a significant IT (Information Technology) investment. Throughout 2008, we saw first-hand that groups building their own IT infrastructures were not only challenged by investing heavily in quickly depreciating hardware assets, they experienced basic infrastructure challenges like having enough space, power, and cooling systems for the equipment. If those problems were solved, there were the other challenges with getting systems set up, running, installing software, and having experienced people - and time - to maintain the infrastructure. SaaS solves those problems and off loads the burden of maintaining expensive infrastructures. For a number of groups, locally run systems are the right choice. However, it is a choice that should be carefully thought out and well-planned. In our experience, customers choosing the SaaS option were up and running quicker at a lower cost than our customers who chose to build their systems.

As we close 2008 and look forward to 2009, we want to especially thank our customers for their support and the interesting problems they have invited us to help solve.

Friday, December 12, 2008

Papers, Papers, and more Papers

Next Gen Sequencing is hot, hot, hot! You can tell by the numbers and frequency in which papers are being published.

A few posts ago, I wrote about a couple of grant proposals that we were preparing on methods to detect rare variants in cancer and improve the tools and methods to validate datasets from quantitative assays that utilize Next Gen data, like RNA-Seq, ChIP-Seq, or Other-Seq experiments. Besides the normal challenges of getting two proposals written and uploaded to the NIH, there was an additional challenge. Nearly everyday, we opened the tables-of-contents in our e-mail and found a new papers highlighting Next Gen Sequencing techniques, applications, or biological discoveries made through Next Gen techniques. To date, over 200 Next Gen publications have been produced. During the last two months alone more than 30 papers have been published. Some of these (listed in the figure below) were relevant to the proposals we were drafting.

The papers highlighted many of the themes we've touched on here, including the advantages of Next Gen sequencing and challenges with dealing with the data. As we are learning, these technologies allow us to explore the genome and genomics of systems biology at significantly higher resolutions than previously imagined. In one of the higher profile efforts, teams at the Washington University School of Medical and Genome Center compared a leukemia genome to a normal genome using cells from the same patient. This first intra-person whole genome analysis identified acquired mutations in ten genes, eight of which were new. Interestingly, the eight genes have unknown functions and might be important some day for new therapies.

Next Gen technologies are also confirming that molecular biology is more complicated than we thought. For example, the four most recent papers in Science show us that not only is 90% of the genome actively transcribed, but many genes have both sense and anti-sense RNA expressed. It is speculated that the anti-sense transcripts have a role in regulating gene expression. Also, we are seeing that nearly every gene produces alternatively spiced transcripts. The most recent papers indicate that between 92% and 97% of transcripts are alternatively spliced. My guess is that the only genes, not alternatively spliced are those lacking introns, like olfactory receptors. Although, when alternative transcription starts and alternative polyadenylation sites are considered, we may see that all genes are processed in multiple ways. It will be interesting to see how the products of alternative splicing and anti-sense transcription might interact.

This work has a number of take home messages.

Like astronomy, when we can see deeper we see more. Next Gen technologies are giving us the means to interrogate large collections of individual RNA or DNA molecules and speculate more on functional consequences.
Our limits are our imaginations. The reported experiments have used a variety of creative approaches to study genomic variation, sample expressed molecules from different strands of DNA, and measure protein DNA/RNA interaction.
Good hands do good science. As pointed out in the paper from the Sanger Center on their implementation of Next Gen sequencing, the processes are complex and technically demanding. You need to have good laboratory practices with strong informatics support for all phases (laboratory, data management, and data analysis) of the Next Gen sequencing processes.

The final point is very important and Geospiza’s lab management and data analysis products will simplify your efforts in getting Next Gen systems running to make your major investment pay off and quickly publish results.

To see how, join us for a webinar next Wednesday, Dec. 17 at 10 am PDT, for RNA Expression Analysis with Geospiza.

Click on the figure to enlarge the text.

Wednesday, December 10, 2008

Sneak Peak: RNA Expression Analysis with Geospiza

Next Generation DNA sequencing is revolutionizing transcriptome analysis and giving us much deeper insights into the ways in which genes are expressed. Next Wednesday, December 17th, Geospiza will host a webinar on how FinchLab and GeneSifter simplify complex data analyses to turn millions of reads into informative datasets that can yield scientific insights.

Next Gen sequencing is quickly becoming an attractive option for gene expression analysis because the vast numbers of sequences that can be obtained provide a highly sensitive way to evaluate the RNA population inside of a cell. In addition to rRNA, tRNA, and mRNA, new assays are quickly emerging to measuring non-coding RNA and multiple classes of small RNAs as well. Moreover, as we obtain deeper information, largely through Next Gen, we learn that even mRNA is more complicated than previously thought. In yeast, 85% of the genome might be transcribed and new reports indicate that 92-97% of human genes undergo alternative splicing.

Next Gen sequencing applications such as RNA-Seq, Tag Profiling, and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies can generate 200 million data points in a single instrument and can completely characterize all known RNAs in a sample, and identify novel RNAs and novel splicing events for known RNAs.

Join us next Wed. Dec 17 at 10:00 am (PDT) as we provide an overview of two applications, RNA-Seq and miRNA-Seq, using examples from publicly available datasets. The presentation will include a discussion of the challenges and solutions for how sequence data from the transcriptome can be analyzed in routine ways with Geospiza’s products.

Register Now!

Further reading

Castle J.C., Zhang C., Shah J.K., Kulkarni A.V., Kalsotra A., Cooper T.A., Johnson J.M., 2008. Expression of 24,426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines. Nat Genet 40, 1416-1425.

David L., Huber W., Granovskaia M., Toedling J., Palm C.J., Bofkin L., Jones T., Davis R.W., Steinmetz L.M., 2006. A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci U S A 103, 5320-5325.

Mortazavi A., Williams B.A., McCue K., Schaeffer L., Wold B., 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621-628.

Seila A.C., Calabrese J.M., Levine S.S., Yeo G.W., Rahl P.B., Flynn R.A., Young R.A., Sharp P.A., 2008. Divergent Transcription from Active Promoters. Science.

Wang E.T., Sandberg R., Luo S., Khrebtukova I., Zhang L., Mayr C., Kingsmore S.F., Schroth G.P., Burge C.B., 2008. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470-476.

Wold B., Myers R.M., 2008. Sequence census methods for functional genomics. Nat Methods 5, 19-21.

Zamore P.D., Haley B., 2005. Ribo-gnome: the big world of small RNAs. Science 309, 1519-1524.

Tuesday, December 2, 2008

ABRF 2009 is just around the corner

Karen Jonscher said it best, “Reminder - register now for the [ABRF] Satellite Educational Workshops!”

In her email to the ABRF email forum, Karen reminded us that the ABRF Education Committee is excited to present five new Satellite Educational Workshops at ABRF 2009 in Memphis Tennessee on genomics and proteomics technologies. Of course, I think the most exciting topic is Next Generation DNA Sequencing, subtitled "Massively Parallel Sequencers in the Core Facility: Applications and Computation."

The workshop will have a full day of presentations and discussions. The first part, “Platforms and Applications” will focus on the laboratory perspective of running these systems. We will have three presentations on what is it like to prepare and run samples as well as troubleshoot equipment and how to review quality control data.

The second part, “Computation and Analysis” will tackle heady issues of what to do with the massive amounts of data being produced. Presenters will provide information ranging from an overview of data analysis, to data management infrastructures, to discovery based analysis for SNP and biomarker discovery.

During the day there will be time to meet and speak with the presenters as well as representatives from sponsoring companies. It will be good.

Both general information about all of the workshops and specific information about the next generation sequencing workshop are posted at the ABRF site. Don't wait, you might miss a great opportunity.

Hope to see you in Memphis.

Thursday, November 20, 2008

Introducing GeneSifter

Today, Geospiza announced the acquisition of the award-winning GeneSifter microarray data analysis product. This news has significant implications for Geospiza’s current and new customers. With GeneSifter and FinchLab, Geospiza will deliver complete end to end systems for data intensive genetic analysis applications like Next Gen sequencing and microarrays.

As an example, let's consider transcriptomics or gene expression. One goal of such experiments is to compare the relative gene expression between cells to see how different genes are up or down regulated as the cells change over time or respond to some sort of treatment.

The general process, whether it involves microarrays or Next Gen sequencing, is to measure the number of RNA molecules for a given gene, either over a period of time or after different treatments. Laboratory processes create the molecules to assay, the molecules are measured, data are collected, and we process the data to produce tables of information. These tables are then compared with one another to identify genes that are differentially expressed. With the gene expression results in hand, one can delve deeper by utilizing other databases like Entrez Gene or pathway sites to learn about gene function and gain insights.

From a systems perspective, you need a LIMS to define sample information and keep track of workflow steps and the data generated at the bench. You will also need to track which samples are on a slide, or lane, or well when the data are collected. You will need to store and organize the data by sample. Then, you will need to analyze the data through multiple programs in a pipelined process (filter, align ...) to produce information, like gene lists, that can be compared for each sample. You may want to review this information to see that your experiments are on track and then, if they are, you will want to compare the gene lists from different experiments to tell a story.

FinchLab, combined with Geospiza’s hosted Software as a Service (SaaS) delivery, solves challenges related to IT, LIMS, and the core data analysis. GeneSifter completes the process by delivering a software solution that lets you compare your gene lists. GeneSifter provides information about the relative gene expression between samples and links gene information to key public resources to uncover additional details.

It's an exciting time for those in the genetic analysis and genomics fields. New high throughput data collection technologies are giving scientists the ability to interrogate systems and understand biology in a whole new way. As we come to the end of 2008 and think about 2009, Geospiza is excited to think about how we will integrate and extend our products to further develop end to end systems for a wide variety of genomics applications that target basic and clinical research to help us improve human health and well being.

Sunday, November 9, 2008

Next Gen-Omics

Advances in Next Gen technologies have led to a number of significant papers in recent months, highlighting their potential to advance our understanding of cancer and human genetics (1-3). These and the other 100's of papers demonstrate the value of Next Gen sequencing. The work completed thus far has been significant, but much more needs to be done to make these new technologies useful for a broad range of applications. Experiments will get harder.

While much of the discussion in the press focuses on rapidly sequencing human genomes for low cost as part of the grail of personalized genomics (4), a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of "omics" have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry.

As noted in the last post we are developing proposals to further advance the state-of-the-art in working with Next Gen data sets. In one of those proposals, Geospiza will develop novel approaches to work with data from applications of Next Gen sequencing technologies that are being developed study the omics of DNA transcription and gene expression.

Toward furthering our understanding of gene expression, Next Gen DNA sequencing is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw experimental conclusions. Recall the three phases of data analysis.

However, to be useful these data sets need to come from experiments that measure what we think they should measure. The data must be high quality and free of artifacts. In order to compare quantitative information between samples, the data sets must be refined and normalized so that biases introduced through sample processing are accounted for. Thus, a fundamental challenge to performing these kinds of experiments is working with the data sets that are produced. In this regard numerous challenges exist.

The obvious ones relating to data storage and bioinformatics are being identified in both the press and scientific literature (5,6). Other, less published, issues include a lack of:

standard methods and controls to verify datasets in the context of their experiments,
standardized ways to describe experimental information and
standardized quality metrics to compare measurements between experiments.

Moreover data visualization tools and other user interfaces, if available, are primitive and significantly slow that pace at which a researcher can work with the data. Finally, information technology (IT) infrastructures that can integrate the system parts dealing with sample tracking, experimental data entry, data management, data processing and result presentation are incomplete.

We will tackle the above challenges by working with the community to develop new data analysis methods that can run independently and within Geospiza's FinchLab. FinchLab handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible order interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. Geospiza's hosted (Software as a Service [SaaS]) delivery models remove additional IT barriers.

FinchLab's data management and analysis server make the system scalable through a distributed architecture. The current implementation of the analysis server creates a complete platform to rapidly prototype new data analysis workflows and will allow us to quickly devise and execute feasibility tests, experiment with new data representations, and iteratively develop the needed data models to integrate results with experimental details.

References

1. Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66-72 (2008).

2. Wang, J., Wang, W., Li, R., Li, Y., et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008).

3. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008).

4. My genome. So what? Nature 456, 1 (2008).

5. Prepare for the deluge. Nature Biotechnology 26, 1099 (2008).

6. Byte-ing off more than you can chew. Nature Methods 5, 577 (2008).

Tuesday, November 4, 2008

November is about elections and grants

Halloween is over, and today is election day, go vote! Then, come back and read about what we planning for our next steps in Next Gen sequencing.

This month we are preparing SBIR proposals to target some of the real challenges researchers face when working with next-generation (Next Gen) sequence data.

The first will deal with issues related to detecting rare variants in cancer. With our collaborators we plan to develop control samples to detect different kinds of mutations and use the samples and data produced to develop new software tools and interfaces for measuring results. While the work will focus on cancer research, detecting rare variants in large datasets is a common problem with many applications.

The second proposal will deal with improving tools and methods to validate datasets from quantitative assays that utilize Next Gen data. When you run an RNA-Seq, ChIP-Seq, or Other-Seq experiment, where you collect numerous molecular tags from RNA or DNA in your research samples, how do you know that your data represent interesting biology and are free of artifacts? Pulling the relevent features, that distinguish biological reality from experimental artifact, out of datasets comprised of millions and millions of reads can be a real problem.

The above projects will leverge Geospiza's, Geospiza customers, and community experience to develop novel features and integrated resources that will be added to FinchLab, new products, and open-source contributions to make your work easier. If you are interested in learning more about these projects and (or) participating in working with us to takle the next-generation of hard problems contact us.

Wednesday, October 22, 2008

Journal Club: Focus on Next Gen Sequencing

Yesterday I received my issue of Nature Biotechnology. This month it features Next-Generation (Next Gen) Sequencing. One editorial, one profile, three news features, a commentary, two perspectives, and two reviews discuss the origins, trials, tribulations and what’s coming next in Next Gen. For now, I'll focus on the editorial.

Bioinformatics is a big big issue

“If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow” was the closing statement in the lead editorial “Prepare for the deluge.”

Reminds me of something I said a few months back.

In the editorial, Nature Biotechnology (NBT) makes a number of important points starting with how the launch of the Roche/454 pyrosequencer in 2005 could generate as much data as more than 50 ABI capillary sequencers. Since that launch, we have seen new instruments emerge that are producing ever increasing amounts of data by orders of magnitude. Or as NBT put it “The overwhelming amounts of data being produced are the equivalent of taking a drink from a fire hose.”

It's like they read our web site (we ran the image below at the beginning of the year).

The volumes of data and new ways in which it must be worked with are creating many challenges. To begin, there is the conundrum of what to keep; do you keep raw images and processed reads? Or do you just keep the reads? If you keep raw images, the costs are significant. The cost of storing all that information must be considered in the context of the likelihood of whether you will ever need to go back to these data. We call this the data life cycle.

From raw images, the next challenge is the computational infrastructure needed to process reads and obtain meaningful information. This is a complex process that involves many steps and high performance computers. NBT made the accurate and important point that the instrument manufacturer only provide the software to analyze what comes off of the machine for common applications. A great deal of bioinformatics support is needed for downstream analysis once the initial data alignments or assemblies are completed. Also, standards for comparing data between instrument platforms are lacking. This makes it difficult to compare results from different instruments.

While more is needed in terms of bioinformatics support, being able to get tools for alignment and assembly is a good starting point and NBT lauded ABI’s SOLiD community program as a step in the right direction. This kind of approach is also needed by the other instrument vendors. Presently Illumina and Roche include their tools with an instrument purchase. This is fine for the laboratory, but it makes a hard problem harder for any researchers who might be getting data sets from different labs. This could lead to threads of frustration.

As the article continued, the "overwhelmed" scale increased to dire.

NBT stated:

“What all of this means is that for the foreseeable future, next-generation sequencing platforms may remain out of the hands of labs lacking the deep pockets needed for bioinformatics support.”

They also added,

“Thus, if the next-generation platforms are to truly democratize sequencing—bringing genomics out of the main sequencing centers and into the laboratories of single investigators or small academic consortia—much more effort needs to be expended in developing cost-effective software and data management solutions.”

NBT offered some solutions, including getting the instrument vendors to develop community based solutions, and encouraging the grant funding organizations to fund bioinformatics as much as they fund sequencing.

Is Next Gen for everyone?

The NBT editors made a lot of great points, but we do not see the world in as dire terms as they do. Yes, a great challenge to Next Gen and getting up and running with this equipment includes preparing for the informatics challenges that await. Next Gen is not Sanger. You cannot look at every read to figure out what your data mean and you will need a serious computational infrastructure to store, organize and work with the data. Also, not mentioned in the article, but incredibly important, you will need a laboratory information management system to organize your experimental information and track the many steps needed to prepare good DNA libraries for sequencing.

And, there are solutions.

Geospiza’s FinchLab combined with our Software as a Service (SaaS) delivery, provides immediate access to the necessary software and hardware infrastructure to run these new instruments.

FinchLab delivers the software infrastructure to support laboratory workflows for all the platforms, links the resulting data to samples, and - through a growing list of data analysis pipelines and visualization interfaces - provides the necessary bioinformatics for a wide range of sequencing applications. Further, our bioinformatics approach is community-based. We are working with the best tools as they emerge and are collaborating with multiple groups to advance additional research and development.

SaaS delivers the computing infrastructure on demand. With our SaaS model, the computer infrastructure is always available and grows with your needs. You do not have to set up a large computer system, or build a new building, or risk over or under investing to deal with the data.

With FinchLab, the vision of next-generation platforms truly democratizing sequencing can be realized.

Friday, October 17, 2008

Uploading your data to iFinch

iFinch is a scaled down version of our V2 Finch system for genetic analysis.

Unlike our larger, industrial strength systems, iFinch is designed for individual researchers, small labs, or teachers who want a trouble-free system for managing and working with genetic data. Currently, students and teachers are using iFinch as part of the Bio-Rad Explorer Cloning and Sequencing kit.

I call iFinch "bioinformatics in a box." I've used iFinch in two bioinformatics courses and it's been pretty helpful. iFinch and FinchTV play nicely together and the combination works well for students.

You don't even have to get a computer for storing data or learn how to manage a database. We do all that for you and you use the system through the web. It's nice and painless.

**********NOTE***********

If you received an iFinch account from Bio-Rad, you will need to turn on your data processor before you begin uploading data.

Checking and starting your Finch data processor

1. Log into your iFinch account.

2. Find and select the Data Processor link in the System menu.

3. Look at the Data processor status.

4. If the Data processor has stopped, you will need to Restart it by selecting the Restart button. If you are a student, you will need to have an instructor log in and do this.

Once your data processor has been started, you can go ahead and upload your data as shown in the movie below.

Uploading your data
The first thing we do with iFinch is to put our data into the iFinch database. In the movie, you can see how we upload chromatograms through the web interface.

iFinch can store any kind of file, but it really shines when it comes to working with chromatograms or genotyping data.

If you have lots of files (more than a few 96 well plates), we do have other systems for uploading data. But, that's another post and another movie.

Wednesday, October 8, 2008

Road Trip: AB SOLiD Users Meeting

Wow! That's the best way to summarize my impressions from the Applied Biosystems (AB) SOLiD users conference last week, when AB launched their V3 SOLiD platform. AB claims that this system will be capable of delivering a human genome's worth of data for about $10,000 US.

Last spring, the race to the $1000 genome leaped forward when AB announced that they sequenced a human genome at 12-fold coverage for $60,000. When the new system ships in early 2009, that same project can be completed for $10,000. Also, this week others have claimed progress towards a $5000 human genome.

That's all great, but what can you do with this technology besides human genomes?

That was the focus of the SOLiD users conference. For a day and a half, we were treated to presentations from scientists and product managers from AB as well as SOLiD customers who have been developing interesting applications. Highlights are described below.

Technology Improvements:

Increasing Data Throughput - Practically everyone is facing the challenge of dealing with large volumes of data, and now we've learned the new version of the SOLiD system will produce even more. A single instrument run will produce between 125 million to 400 million reads depending on the application. This scale up is achieved by increasing the bead density on a slide, dropping the overall cost per individual read. Read lengths are also increasing, making it possible to get between 30 and 40 gigabases of data from a run. And, the amount of time required for each run is shrinking; not only can you get all of these data, you can do it again more quickly.

Increasing Sample Scale - Many people like to say, yes, the data is a problem, but at least the sample numbers are low, so sample tracking is not that hard.

Maybe they spoke too soon.

AB and the other companies with Next Gen technologies are working to deliver "molecular barcodes" that allow researchers to combine multiple samples on a single slide. This is called "multiplexing." In multiplexing, the samples are distinguished by tagging each one with a unique sequence, the barcode. After the run, the software uses the sequence tags to sort the data into their respective data sets. The bottom line is that we will go from a system that generates a lot of data from a few samples, to a system that generates even more data from a lot of samples.

Science:

What you can do with 100's of millions of reads: On the science side, there were many good presentations that focused on RNA-Seq and variant detection using the SOLiD system. Of particular interest was Dr. Gail Payne's presentation on the work, recently published in Genome Research, entitled "Whole Genome Mutational Profiling Using Next Generation Sequencing Technology." In the paper, the 454, Illumina, and SOLiD sequencing platforms were compared for their abilities to accurately detect mutations in a common system. This is one of the first head to head to head comparisons to date. Like the presidential debates, I'm sure each platform will be claimed to be the best by its vendor.

From the presentation and paper, the SOLiD platform does offer a clear advantage in its total throughput capacity. 454 showed showed the long read advantage in that approximately 1.5% more of the yeast genome studied was covered by 454 data than with shorter read technology. And, the SOLiD system, with its dibase (color space) encoding, seemed to provide higher sequence accuracy. When the reads were normalized to the same levels of coverage, a small advantage for SOLiD, can be seen.

When false positive rates of mutation detection were compared, SOLiD had zero for all levels of coverage (6x, 8x, 10x, 20x, 30x, 175x [full run of two slides]), Illumina had two false positives at 6x and 13x, and zero false positives for 19x and 44x (full run of one slide) coverage, and 454 had 17, six, and one false positive for 6x, 8x, and 11x (full run) coverage, respectively.

In terms of false negative (missed) mutations, all platforms did a good job. At coverages above 10x, none of the platforms missed any mutations. The 454 platform missed a single mutation at 6x and 8x coverage and Illumina missed two mutations at 6x coverage. SOLiD, on the other hand, missed four and five at 8x and 6x coverage, respectively.

What was not clear from the paper and data, was the reproducibility of these results. From what I can tell, single DNA libraries were prepared and sequenced; but replicates were lacking. Would the results change if each library preparation and sequencing process was repeated?

Finally, the work demonstrates that it is very challenging to perform a clean "apples to apples" comparison. The 454 and Illumina data were aligned with Mosiak and the SOLiD data were aligned with MapReads. Since each system produces different error profiles and the different software programs each make different assumptions about how to use the error profiles to align data and assess variation, the results should not be over interpreted. I do, however, agree with the authors, that these systems are well-suited for rapidly detecting mutations in a high throughput manner.

ChIP-Seq / RNA-Seq: On the second day, Dr. Jessie Gray presented work on combining ChIP-Seq and RNA-Seq to study gene expression. This is important work because it illustrates the power of Next Gen technology and creative ways in which experiments can be designed.

Dr. Gray's experiment was designed to look at this question: When we see that a transcription factor is bound to DNA, how do we know if that transcription factor is really involved in turning on gene expression?

ChIP-Seq allows us to determine where different transcription factors are bound to DNA at a given time, but it does not tell us whether that binding event turned on transcription. RNA-Seq tells us if transcription is turned on, after a given treatment or point in time, but it doesn't tell us which transcription factors were involved. Thus, if we can combine ChiP-Seq and RNA-Seq measurements, we can elucidate a cause and effect model and find where a transcription factor is binding and which genes it potentially controls.

This might be harder than it sounds:

As I listened to this work, I was struck by two challenges. On the computational side, one has to not only think about how to organize and process the sequence data into alignments and reduce those aligned datasets into organized tables that can be compared, but also how to create the right kind of interfaces for combining and interactively exploring the data sets.

On the biochemistry side, the challenges presented with ChIP-Seq reminded me of the old adage of trying to purify disapearase - "the more you purify the less there is." ChIP-Seq and other assays that involve multiple steps of chemical treatments and purification, produce vanishingly small amounts of material for sampling. The later challenge complicates the first challenge, because in systems where one works with "invisible" amounts of DNA, a lot of creative PCR, like "in gel PCR" is required to generate sufficient quantities of sample for measurement.

PCR is good for many things, including generating artifacts. So, the computation problem expands. A software system that generates alignments, reduces them to data sets that can be combined in different ways, and provides interactive user interfaces for data exploration, must also be able to understand common artifacts so that results can be quality controlled. Data visualizations must also be provided so that researchers can distinguish biological observations from experimental error.

These are exactly the kinds of problems that Geospiza solves.

Monday, October 6, 2008

Sneak Peak: Genetic Analysis From Capillary Electrophoresis to SOLiD

On October 7, 2008 Geospiza hosted a webinar featuring the FinchLab, the only software product to track the entire genetic analysis process, from sample preparation, through processing to analyzed results.

If you are as disappointed about missing it as we are about you missing, no worries. You can get the presentation here.

If you are interested in:

Learning about Next Gen sequencing applications
Seeing what makes the Applied Biosystems SOLiD system powerful for transcriptome analysis, CHiP-Seq, resequenicng experiments, and other applications
Understanding the flow of data and information as samples are converted into results
Overcoming the significant data management challenges that accompany Next Gen technologies
Setting up Next Gen sequencing in your core lab
Creating a new lab with Next Gen technologies

This webinar is for you!

In the webinar, we talked about the general applications of Next Gen sequencing and focused on using SOLiD to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling and whole transcriptome analysis. Throughout the talk we gave specific examples about collecting and analyzing SOLiD data and showed how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Thursday, September 18, 2008

Road Trip: 454 Users Conference

Quiz: What can sequence small genomes in a single run? What can more than double or triple the EST database for any organism?
Answer: The Roche (454) Genome Sequencer FLX™ System.

Last week I had the pleasure of attending the Roche 454 users conference where the new release (Titanium) of the 454 sequencer was highlighted . This upgrade produces more, longer reads so that more than 600 million bases can be generated in each run. When compared to previous versions, the FLX Titanium produces about five times more data. The conference was well attended and outstanding with informative presentations on science, technology, and practical experiences.

In the morning of the first full day, Bill Farmerie, from the University of Florida, presented on how he got into DNA sequencing as a service and how he sees Next Gen sequencing changing the core lab environment. Back in 1998 he set out to establish a genomics service and talked to many groups about what to do. They told him two important things:

"Don't sweat the sequencing part - this is what we are trained for."
"Worry about information management - this we are not trained for."

From here, he discussed how Next Gen got started in his lab and related his experiences over the past three years and made these points:

The first two messages are still true. Sequencing gets solved, the problem is informatics.
DNA sequencing is expanding, more data are being produced faster at lower costs.
This is democratizing genomics - many groups now have access to high throughput technology that provides "genome center" capabilities.
The next bioinformatics challenge is enabling the research community, the groups with the sequencing projects, to make use of their data and information. This is not like Sanger, core labs need to deliver results with data.
The way to approach new problems and increase scale is to relieve bioinformatics staff of the burden of doing routine things so they can focus on developing novel applications.
To accomplish the above point, buy what you can and build what you have to.

Other speakers made similar points. The informatics challenge begins in the lab, but quickly becomes a major problem for the end researcher.

Bill has been following his points successfully for many years now. We starting working with him on his first genomics service and continue to support his lab with Next Gen. Our relationship with Bill and his group has been a great experience.

Other highlights from the meeting included:

A talk on continuous process improvements in DNA sequencing at the Broad Institute. Danielle Perrin presented work on how the Broad tackles process optimization issues during production to increase throughput, decrease errors, or save costs. In my perspective, this presentation really stresses the importance of coupling laboratory management with data analysis.

Multiple talks on microbial genomics. A strength of the 454 platform is how it generates long reads making this a platform of choice for sequencing smaller genomes and performing metagenomic surveys. We were also introduced to the RAST (Rapid Annotation using Subsystem Technology) server, an ideal tool for working with your completed genome or metagenome data set.

Many examples of how having millions of reads makes new gene expression and variation analysis discoveries possible when compared to other platforms like microarrays. In these talks speakers were occasionally asked which is better, long 454 reads or short reads from Illumina or SOLiD? The speakers typically said you need both, they complement each other.

The Wolly Mammoth. Steven Schuster from Penn State presented his and colleagues' work on sequencing mammoth DNA and its relatedness over 1000's of years. Next Gen is giving us a new "omics," Museomics.

And, of course, our poster demonstrating how FinchLab provides an end to end workflow solution for 454 DNA sequencing. In the poster (you have to click the image to get the BIG picture), we highlighted some new features coming out at the end of the month. These include the ability to collect custom data during lab processing, coupling Excel to FinchLab forms, and work on 454 data analysis. Now you will be able to enter the bead counts, agarose images, or whatever else you need to track lab details to make those continuous process improvements. Excel coupling makes data entry though FinchLab forms even easier. The 454 data analysis complements our work with Sanger, SOLiD, and Illumina data to make the FinchLab platform complete for any genomics lab.

Thursday, September 4, 2008

The Ends Justify the DNA

In Next Gen experiments, libraries of DNA fragments are created in different ways, from different samples, and sequenced in a massively parallel format. The preparation of libraries is a key step in these experiments. Understanding and validating the results requires knowing how the libraries were created and where the samples came from.

Background

In the last post, I introduced the concept that nearly all Next Gen sequencing applications are fundamentally quantitative assays that utilize DNA sequences as data points.

In Sanger sequencing, the new DNA molecules are synthesized, beginning at a single starting point determined by the primer. If the sequencing primer binds to heterogeneous molecules that contain the same binding site, for example, two slightly different viruses in a mixed population, a single read from Sanger sequencing could represent a mixture of many different molecules in the population, with multiple bases at certain positions. Next Gen sequencing, on the other hand, produces single reads from single individual molecules. This difference between the two methods allows one to simultaneously collect millions of sequence reads in a massively parallel format from single samples.

An additional benefit of massively parallel sequencing is that it eliminates the need to clone DNA, or create numerous PCR products. Although this change reduces the complexity of tracking samples, it increases the need to track experiments with greater detail and think about how we work with the data, how we analyze the data, and how we validate our observations to generate hypotheses, make discoveries, and identify new kinds of systematic artifacts.

Making Libraries

To better understand the significance of what a Next Gen experiment measures, we need to understand what DNA libraries are and how they are prepared. For this discussion we'll define a DNA library as a random collection of DNA molecules (or fragments) that can be separated and identified.

Before we do any kind of Next Gen experiment, we want to know something about the kinds of results we’d expect to see from our library. To begin, let’s consider what we would see from a genomic library consisting of EcoRI restriction fragments. If the digestion is complete, EcoRI will cut DNA between an G and A every time it encounters the sequence: 5'-GAATTC-3'. Every fragment in this library would have the sequence 5'-AATT-3' at every 5’ end. The average length of the fragments will be 4096 bases (~5 kbp). However, the distribution of fragment lengths follows Poisson statistics [1], so the actual library will have a few very large fragments (>> 5 kbp) and numerous small fragments

You may ask “why is this useful?”

Our EcoRI library example helps us to think about our expectations for Next Gen experimental results. That is, if we collect 10 million reads from a sample, what should we expect to see when we compare our data to reference data? We need to know what kinds of results to expect in order to determine if our data represent discoveries, or artifacts. Artifacts can be introduced during sample preparation, sample tracking, library preparation, or from the data collection instruments. If we can’t distinguish between artifacts and discoveries, the artifacts will slow us down and lead to risky publications.

In the case of our EcoRI digest, we can use our predictions to validate our results. If we collected sequences from the estimated 732,000 fragments and aligned the resulting data back to a reference genome, we would expect to see blocks of aligned reads at every one of the 732,000 restriction sites. Further, for each site there should be two blocks, one showing matches to the "forward" strand and one showing matches to the "reverse" strand.

We could also validate our data set by identifying the positions of EcoRI restriction sites in our reference data. What we'd likely see is that most things work perfectly. In some cases, however, we would also see alignments, but no evidence of a restriction site. In other cases, we would see a restriction site in the reference genome, but no alignments. These deviations would identify differences between the reference sequence and the sequence of the genome we used for the experiment. Those differences could either result from errors in the sequence of the reference data or a true biological difference. In the latter case, we would examine the bases and confirm the presence of a restriction length fragment polymorphism (RFLPs). From this example, we can see how we can define the expected results, and use that prediction to validate our data and determine whether our results correspond to interesting biology or experimental error.

Digital Gene Expression

Of course what we expect to see in the data is a function of the kind of experiment we are trying to do. To illustrate this point I'll compare two different kinds of Next Gen experiments that are both used to measure gene expression: Tag Profiling and RNA-Seq.

In Tag Profiling, mRNA is attached to a bead, converted to cDNA, and digested with restriction enzymes. The single fragments that remain attached to the beads are isolated and ligated to adaptor molecules, each one containing a type II restriction site. The fragments are further digested with the type II restriction enzyme and ligated to a sequencing adaptor to create a library of cDNA ends with 17 unique bases, or tags. Sequencing such a library will, in theory, yield a collection of reads that represents the population of RNA molecules in the starting material. Highly expressed genes will be represented by a larger number of tagged sequences than genes expressed at lower levels.

Both Tag profiling and RNA-Seq begin with an mRNA purification step, but after that point the procedures differ. Rather than synthesize a single full-length cDNA for every transcript, RNA-Seq uses random six-base primers to initiate cDNA synthesis at many different positions in each RNA molecule. Because these primers represent every combination of six base sequences, priming with these sequences produces a collection of overlapping cDNA molecules. Starting points for DNA synthesis will be randomly distributed, giving high sequence coverage for each mRNA in the starting material. Like Tag Profiling, genes expressed at high levels will have more sequences present in the data than genes expressed at low levels. Unlike Tag Profiling, any single transcript will produce several cDNAs aligning at different locations.

When the sequence data sets for Tag Profiling and RNA-seq are compared, we can see how the different methods for preparing the DNA libraries contrast with one another. In this example, Tag Profiling [2] and RNA-seq [3] data sets were aligned to human mRNA reference sequences (RefSeq, NCBI). The data were processed with Maq [4] and results displayed in FinchLab. In both cases, relative gene expression can be estimated by the number of sequences that align. If we know the origins of the libraries, the kinds of genes and their expression can give us confidence that the results fit the expression profile we expect. For example the RNA-seq data set is from mouse brain and we see genes at the top of the list that we expect to be expressed in this kind of tissue (last figure below).

The Tag Profiling and RNA-seq data sets also show striking differences that reflect how the libraries are prepared. In each report, the second column gives information about the distribution of alignments in the reference sequence. For Tag Profiling this is reported as "Tags." The number of Tags corresponds to the number of positions along the reference sequence where the tagged sequences align. In an ideal system, we would expect one tag per molecule of RNA. Next Gen experiments however, are very sensitive, so we can also see tags for incomplete digests. Additionally, sequencing errors, and high mismatch tolerance in the alignments can sometimes place reads incorrectly and give unusually high numbers of tags. When the data are more closely examined, we do see that the distribution of alignments follows our expectations more closely. That is, we generally see a high number of reads at one site, with the other tag sites showing a low number of aligned reads.

For RNA-seq, on the other hand, we display the second column (Read Map) as an alignment graph. For RNA-seq data, we expect that the number of alignment start points will be very high and randomly distributed throughout the sequence. We can see that this expectation matches our results by examining the thumbnail plots. In the Read Map graphs, the x-axis represents the gene length and the y-axis is the base density. Presently, all graphs have their data plotted on a normalized x-axis, so the length of an mRNA sequence corresponds to the density of data points in the graph. Longer genes have points that are closer together. You can also see gaps in the plots; some are internal and many are at the 3'-end of the genes. When the alignments are examined more closely, and we incorporate our knowledge of the exon structure or polyA addition sites, we can see that many of these gaps either show potential sites for alternative splicing or data annotation issues.

In summary, Next Gen experiments use DNA sequencing to identify and count molecules, from libraries, in a massively parallel format. The preparation of the libraries allows us to define expected outcomes for the experiment and choose methods for validating the resulting data. FinchLab makes use of this information to display data in ways that make it easy to quickly observe results from millions of sequence data points. With these high-level views and links to drill down reports and external resources, FinchLab provides researchers with the tools needed to determine whether their experiments are on track to creating new insights, or if new approaches are needed to avoid artifacts.

References

[1] The distribution of restriction enzyme sites in Escherichia coli. G A Churchill, D L Daniels, and M S Waterman. Nucleic Acids Res. 1990 February 11; 18(3): 589–597.

[2] Tag Profile dataset was obtained from Illumina.

[3] Mapping and quantifying mammalian transcriptomes by RNA-Seq. A Mortazavi, BA Williams, K McCue K, L Schaeffer, B Wold. Nat Methods. 2008 Jul;5(7):621-8. Epub 2008 May 30.
Data available at: http://woldlab.caltech.edu/rnaseq/

[4] Mapping short DNA sequencing reads and calling variants using mapping quality scores. H Li, J Ruan, R Durbin. Genome Res. 2008 Aug 19. [Epub ahead of print]

Tuesday, August 26, 2008

Maq in the Literature

Kudos to Heng Li and team at the Sanger Center. Today Genome Research published their paper on Maq. Maq (Mapping and Assembly with Quality) is an algorithm, developed at the Sanger Center, for assembling Next Gen reads to a reference sequence. MassGenomics sums up why they like Maq and we could not agree more. I also agree that Maq is better name name than mapASS.

One of the things we like best is how versatile the program is for Next Gen applications. Whether you are performing Tag Profiling, ChIP-Seq, RNA-Seq (transcriptome analysis) resequencing, or other applications, its output contains a wide variety of useful information as we will show in coming posts. If you want to know right now, give us a call and we'll show you why Geospiza, Sanger, Washington University and many others think Maq is a great place to start working with Next Gen data.

Wednesday, August 20, 2008

Next Gen DNA Sequencing Is Not Sequencing DNA

In the old days, we used DNA sequencing primarily to learn about the sequence and structure of a cloned gene. As the technology and throughput improved, DNA sequencing became a tool for investigating entire genomes. Today, with the exception of de novo sequencing, Next Gen sequencing has changed the way we use DNA sequences. We're no longer looking for new DNA sequences. We're using Next Gen technologies to perform quantitative assays with DNA sequences as the data points. This is a different way of thinking about the data and it impacts how we think about our experiments, data analysis, and IT systems.

In de novo sequencing, the DNA sequence of a new genome, or genes from the environment is elucidated. De novo sequencing ventures into the unknown. Each new genome brings new challenges with respect to interspersed repeats, large segmented gene duplications, polyploidy and interchromosomal variation. The high redundancy samples obtained from Next Gen technology lower the cost and speed this process because less time is required for getting additional data to fill in gaps and finish the work.

The other ultra high throughput DNA sequencing applications, on the other hand, focus on collecting sequences from DNA or RNA molecules for which we already have genomic data. Generally called "resequencing," these applications involve collecting and aligning sequence reads to genomic reference data. Experimental information is obtained by tabulating the frequency, positional information, and variation of the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw conclusions.

DNA sequences are information rich data points

EST (expressed sequence tag) sequencing was one of the first applications to use sequence data in a quantitative way. In EST applications, mRNA from cells was isolated, converted to cDNA, cloned, and sequenced. The data from an EST library provided both new and quantitative information. Because each read came from a single molecule of mRNA, a set of ESTs could be assembled and counted to learn about gene expression. The composition and number of distinct mRNAs from different kinds of tissues could be compared and used to identify genes that were expressed at different time points during development, in different tissues, and in different disease states, such as cancer. The term "tag" was invented to indicate that ESTs could also be used to identify the genomic location of mRNA molecules. Although the information from EST libraries was been informative, lower cost methods such as microarray hybridization and real time-PCR assays replaced EST sequencing over time, as more genomic information became available.

Another quantitative use of sequencing has been to assess allele frequency and identify new variants. These assays are commonly known as "resequencing" since they involve sequencing a known region of genomic DNA in a large number of individuals. Since the regions of DNA under investigation are often related to health or disease, the NIH has proposed that these assays be called "Medical Sequencing." The suggested change also serves to avoid giving the public the impression that resequencing is being carried out to correct mistakes.

Unlike many assay systems (hybridization, enzyme activity, protein binding ...) where an event or complex interaction is measured and described by a single data value, a quantitative assay based on DNA sequences yields a greater variety of information. In a technique analogous to using an EST library, an RNA library can be sequenced, and the expression of many genes can be measured at once, by counting the number of samples that align to a given position or reference. If the library is prepared from DNA, a count of the aligned reads could measure the copy number of a gene. The composition of the read data itself can be informative. Mismatches in aligned reads can help discern alleles of a gene, or members of a gene family. In a variation assay, reads can both assess the frequency of a SNP and discover new variation. DNA sequences could be used in quantitative assays to some extent with Sanger sequencing, but the cost and labor requirements prevented wide spread adoption.

Next Gen adds a global perspective and new challenges

The power of Next Gen experiments comes from sequencing DNA libraries in a massively parallel fashion. Traditionally, a DNA library was used to clone genes. The library was prepared by isolating and fragmenting genomic DNA, ligating the pieces to a plasmid vector, transforming bacteria with the ligation products, and growing colonies of bacteria on plates with antibiotics. The plasmid vector would allow a transformed bacterial cell to grow in the presence of an antibiotic so that transformed cells could be separated from other cells. The transformed cells would then be screened for the presence of a DNA insert or gene of interest through additional selection, colorimetric assay (e.g. blue / white), or blotting. Over time, these basic procedures were refined and scaled up in factory style production to enable high throughput shotgun sequencing and EST sequencing. A significant effort and cost in Sanger sequencing came from the work needed to prepare and track large numbers of clones, or PCR-products, for data linking and later retrieval to close gaps or confirm results.

In Next Gen sequencing, DNA libraries are prepared, but the DNA is not cloned. Instead other techniques are used to "separate," amplify, and sequence individual molecules. The molecules are then sequenced all at once, in parallel, to yield large global data sets in which each read represents a sequence from an individual molecule. The frequency of occurrence of a read in the population of reads can now be used to measure the concentration of individual DNA molecules. Sequencing DNA libraries in this fashion significantly lowers costs, and makes previously cost prohibitive experiments possible. It also changes how we need to think about and perform our experiments.

The first change is that preparing the DNA library is the experiment. Tag profiling, RNA-seq, small RNA, ChIP-seq, DNAse hypersensitivity, methylation, and other assays all have specific ways in which DNA libraries are prepared. Starting materials and fragmentation methods define the experiment and how the resulting datasets will be analyzed and interpreted. The second change is that large numbers of clones no longer need to be prepared, tracked, and stored. This reduces the number of people needed to process samples, and reduces the need for robotics, large number of thermocyclers, and other laboratory equipment. Work that used to require a factory setting can now be done in a single laboratory, or mailroom if you believe the ads.

Attention to details counts

Even though Next Gen sequencing gives us the technical capabilities to ask detailed and quantitative questions about gene structure and expression, successful experiments demand that we pay close attention to the details. Obtaining data that are free of confounding artifacts and accurately represent the molecules in a sample, demands good technique and a focus on detail. DNA libraries no longer involve cloning, but their preparation does require multiple steps performed over multiple days. During this process, different kinds of data ranging from gel images to discrete data values, may be collected and used later for trouble shooting. Tracking the experimental details requires that a system be in place that can be configured to collect information from any number and kind of process. The system also needs to be able to link data to the samples, and convert the information from millions of sequence data points to tables, graphics and other representations that match the context of the experiment and give a global view of how things are working. FinchLab is that kind of system.

Friday, August 8, 2008

ChIP-ing Away at Analysis

ChiP-Seq is becoming a popular way to study the interactions between proteins and DNA. This new technology is made possible by the Next Gen sequencing techniques and sophisticated tools for data management and analysis. Next Gen DNA sequencing provides the power to collect the large amounts of data required. FinchLab is the software system that is needed to track the lab steps, initiate analysis, and see your results.

In recent posts, we stressed the point that unlike Sanger sequencing, Next Gen sequencing demands that data collection and analysis be tightly coupled, and presented our initial approach of analyzing Next Gen data with the Maq program. We also discussed how the different steps (basecalling, alignment, statistical analysis) provide a framework for analyzing Next Gen data and described how these steps belong to three phases: primary, secondary, and tertiary data analysis. Last, we gave an example of how FinchLab can be used to characterize data sets for Tag Profiling experiments. This post expands the discussion to include characterization of data sets for ChIP-Seq.

ChIP-Seq

ChiP (Chromosome Immunoprecipitation) is a technique where DNA binding proteins, like transcription factors, can be localized to regions of a DNA molecule. We can use this method to identify which DNA sequences control expression and regulation for diverse genes. In the ChIP procedure, cells are treated with a reversible cross-linking agent to "fix" proteins to other proteins that are nearby, as well as the chromosomal DNA where they're bound. The DNA is then purified and broken into smaller chunks by digestion or shearing and antibodies are used to precipitate any protein-DNA complexes that contain their target antigen. After the immunoprecipitation step, unbound DNA fragments are washed away, the bound DNA fragments are released, and their sequences are analyzed to determine the DNA sequences that the proteins were bound to. Only few years ago, this procedure was much more complicated than it is today, for example, the fragments had to be cloned before they could be sequenced. When microarrays became available, a microarray-based technique called ChIP-on-chip made this assay more efficient by allowing a large number of precipitated DNA fragments to be tested in fewer steps.

Now, Next Gen sequencing takes ChIP assays to a new level [1]. In ChIP-seq the same cross linking, isolation, immunoprecipitation, and DNA purification steps are carried out. However, instead of hybridizing the resulting DNA fragments to a DNA array, the last step involves adding adaptors and sequencing the individual DNA fragments in parallel. When compared to microarrays, ChiP-seq experiments are less expensive, require fewer hands-on steps and benefit from the lack of hybridization artifacts that plague microarrays. Further, because ChIP-seq experiments produce sequence data, they allow researchers to interrogate the entire chromosome. The experimental results are no longer to the probes on the micoarray. ChIP-Seq data are better at distinguishing similar sites and collecting information about point mutations that may give insights into gene expression. No wonder ChIP-Seq is growing in popularity.

FinchLab

To perform a ChIP-seq experiment, you need to have a Next Gen sequencing instrument. You will also need to have the ability to run an alignment program and work with the resulting data to get your results. This is easier said than done. Once the alignment program runs, you might have to also run additional programs and scripts to translate raw output files to meaningful information. The FinchLab ChIP-seq pipeline, for example, runs Maq to generate the initial output, then runs Maq pileup to convert the data to a pileup file. The pileup file is then read by a script to create the HTML report, thumbnail images to see what is happening and "wig" files that can be viewed in the UCSC Genome Browser. If you do this yourself, you have to learn the nuances of the alignment program, how to run it different ways to create the data sets, and write the scripts to create the HTML reports, graphs, and wig files.

With FinchLab, you can skip those steps. You get the same results by clicking a few links to sort the data, and a few more to select the files, run the pipeline, and view the summarized results. You can also click a single link to send the data to the UCSC genome browser for further exploration.

Reference

ChIP-seq: welcome to the new frontier Nature Methods - 4, 613 - 614 (2007)

Thursday, July 31, 2008

Questions from our mailbag: How do I cite FinchTV?

One of the questions that appears in our mailbox from time to time concerns citing FinchTV or other Geospiza products. A quick search with Google Scholar for "FinchTV" finds 42 examples where FinchTV was cited in research publications. Most of the citations seem to follow the same conventions.

We recommend citing FinchTV as you would any other experimental software tool, instrument, or reagent. The citation should include the version of the program, the company, the location, and the web site. Other Geospiza products (FinchLab, Finch Suite, and iFinch) may be cited in similar manner.

In our case, a citation would most likely read:

FinchTV 1.4.0 (Geospiza, Inc.; Seattle, WA, USA; http://www.geospiza.com)

If you're not sure which version of FinchTV you're using, open the About menu. The version number will appear on the page.

It would also be a good idea to check with the journal where you plan to submit the article. Most journals have a set of instructions for authors where they provide example citations.

Wednesday, July 30, 2008

BioHDF at BOSC

The scale of Next Gen sequencing is only going to increase, hence we need to fundamentally change the way we work with Next Gen data. New software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets used by the applications that analyze Next Gen DNA sequence data.

That was the theme of a talk I presented at the BOSC (Bioinformatics Open Source Conference) meeting that preceded ISMB (Intelligent Systems for Molecular Biology) in Toronto, Canada, July 19th. You can get the slides from the BOSC site. At the same time, we posted a blog on Genographia, a next-generation genomics community web site devoted to Next Gen sequencing discussions and idea sharing. The key points are summarized below.

Motivation

The BioHDF project is motivated by the fact that the next and future generations of data collection technologies, like DNA sequencing, are creating ever increasing amounts of data. Getting meaningful information from these data require that multiple programs be used in complex processes. Current practices for working with these data create many kinds of challenges, ranging from managing large numbers of files and formats to having the computation power and bandwidth to make calculations and move data around. These practices have a high cost in terms of storage, CPU, and bandwidth efficiency. In addition, they require significant human effort in understanding idiosyncratic program behavior and output formats

Is there a better way?

Many would agree that if we could reduce the number of file formats, avoid data duplication, and improve how we access and process data, we could develop better performing and more interoperable applications. Doing so requires improved ways of storing data and making it accessible to programs. For a number of years we have thought about these goals might be accomplished and looked to other data-intensive fields to see how others have solved these problems. Our search ended when we found HDF (hierarchical data format), a standard file format and library used in the physical and earth sciences.

BioHDF

HDF5 can be used in many kinds of bioinformatics applications. For specialized areas, like DNA sequencing, domain specific extensions will be needed. BioHDF is about developing those extensions, through community support, to create a file format and accompanying library of software functions that are needed to build the scalable software applications of the future. More will follow, if you are interested contact me: todd at geospiza.com.

Monday, July 14, 2008

Maq Attack

Maq (Mapping and Assembly with Quality) is an algorithm, developed at the Sanger center, for assembling Next Gen reads onto a reference sequence. Since Maq is widely used for working with Next Generation DNA sequence data, we chose to include support for Maq in our upcoming release of FinchLab. In this post, we will discuss integrating secondary analysis algorithms like Maq with the primary analysis and workflows in FinchLab.

Improving laboratory processes through immediate feedback

The cost to run Next Generation DNA sequencing instruments and the volume of data produced make it important for labs to be able to monitor their processes in real time. In the last post, I discussed how labs can get performance data and accomplish scientific goals during the three stages of data analysis. To quickly review: Primary data analysis involves converting image data to sequence data. Secondary data analysis involves aligning the sequences from the primary data analysis to reference data to create data sets that are used to develop scientific information. An example of a secondary analysis step would be assembling reads into contigs when new genomes are sequenced. Unlike the first two stages, where much of the data is used to detect errors and measure laboratory performance, the last stage is focused on the science. In the Tertiary data analyses genomes are annotated, and data sets are compared. Thus the tertiary analyses are often the most important in terms of gaining new insights. The data used in this phase must be vetted first. It must be high quality and free from systemic errors.

The companies producing Next Gen systems recognize the need to automate primary and secondary analysis. Consequently, they provide some basic algorithms along with the Next Gen instruments. Although these tools can help a lab get started, many labs have found that significant software development is needed on top of the starting tools if they are to fully automate their operation, translate output files into meaningful summaries, and give users easy access to the data. The starter kits from the instrument vendors can also be difficult to adapt when performing other kinds of experiments. Working with Next Gen systems typically means that you will have deal with a lot of disconnected software, a lack of user interfaces, and diverse new choices for algorithms when it comes to getting your work done.

FinchLab and Maq in an integrated system

The Geospiza FinchLab integrates analytical algorithms such as Maq into a complete system that encompasses all the steps in genetic analysis. Our Samples to Results platform provides flexible data entry interfaces to track sample meta data. The laboratory information management system is user configurable so that any kind of genetic analysis procedure can be run and tracked and most importantly provides tight linkage between samples, lab work, and their resulting data. This system makes it easy to transition high quality primary results to secondary data analysis.

One of the challenges with Next Gen sequencing has been choosing an algorithm for secondary analysis. Secondary data analysis needs to be adaptable to different technology platforms and algorithms for specialized sequencing applications. FinchLab meets this need because it can accommodate multiple algorithms when it comes to secondary and tertiary analysis. One of these algorithms is Maq. Maq attractive because it can be used in diverse applications where reads are aligned to a reference sequence. Among these are Transcriptomics (Tag Profiling, EST analysis, small RNA discovery), Promoter Mapping (CHiP-Seq, DNAase hypersensitivity), Methylation analysis, and Variation Analyses (SNP, CNV). Maq offers a rich set of output files so it can be used to quickly provide an overview of your data and help you verify that your experiment is on track before you invest serious time in tertiary work. Finally Maq is being actively developed and improved and is open-source so it is easy to access and use regardless of affiliation.

Maq and other algorithms are integrated into FinchLab through the FinchLab Remote Analysis Server (RAS). RAS is a lightweight job tracking system that can be configured to run any kind of program in different computing environments. RAS communicates with FinchLab to get the data and return the results. Data analyses are run in FinchLab by selecting the sequence file(s), clicking a link to go to a page and select the analysis method(s) and reference data sets, and then clicking a button to start the work. RAS tracks the details of data processing and sends information back to FinchLab so that you can always see what happening through the web interface.

A basic FinchLab system includes the RAS and pipelines for running Maq in two ways. The first is Tag Profiling and Expression Analysis. In this operation, Maq output files are converted to gene lists with links to drill down into the data and NCBI references. The second option it to use Maq in a general analysis procedure where all the output files are made available. In the next months, new tools will convert more of these files into output that can be added to genome browsers and other tertiary analysis systems.

A final strength of RAS is that it produces different kinds of log files to track potential errors. These kinds of files are extremely valuable in trouble-shooting and fixing problems. Since Next Gen technology is new and still in constant flux, you can be certain that unexpected issues will arise. Keeping the research on track is easier when informative RAS logging and reports help to diagnose and resolve issues quickly. Not only can FinchLab help with Next Gen assays, help solve those unexpected Next Gen problems, multiple Next Gen algorithms can be integrated into FinchLab to complete the story.

Wednesday, June 25, 2008

Finch 3: Getting Information Out of Your Data

Geospiza's tag line "From Sample to Results" represents the importance of capturing information from all steps in the laboratory process. Data volumes are important and lots of time is being spent discussing the overwhelming volumes of data produced by new data collection technologies like Next Gen sequencers. However, the real issue is not how you are going to store the data, rather it is what are you going to do with it? What do your data mean in the context of your experiment?

The Geospiza FinchLab software system supports the entire laboratory and data analysis workflow to convert sample information into results. What this means is that the system provides a complete set of web-based interfaces and an underlying database to enter information about samples and experiments, track sample preparation steps in the laboratory, link the resulting data back to samples, and process the data to get biological information. Previous posts have focused on information entry, laboratory workflows, and data linking. This post will focus on how data are processed to get biological information.

The ultra-high data output of Next Gen sequencers allows us to use DNA sequencing to ask many new kinds of questions about structural and nucleotide variation and measure several indicators of expression and transcription control on a genome-wide scale. The data produced consists of images, signal intensity data, quality information, and DNA sequences and quality values. For each data collection run, the total collection of data and files can be enormous and can require significant computing resources. While all of the data have to be dealt with in some fashion, some of the data have long-term value while other data are only needed in the short term. The final scientific results will often be produced by comparing data sets created from the DNA sequences and their comparison to reference data.

Next Gen data are processed in three phases.

Next Gen data workflows involve three distinct phases of work: 1. Data are collected from control and experimental samples. 2. Sequence data obtained from each sample are aligned to reference sequence data, or data sets to produce aligned data sets 3. Summaries of the alignment information from the aligned data sets are compared to produce scientific understanding. Each phase has a discrete analytical process and we, and others, call these phases primary data analysis, secondary data analysis and tertiary data analysis.

Primary data analysis involves converting image data to sequence data. The sequence data can be in familiar "ACTG" sequence space or less familiar color space (SOLiD) or flow space (454). Primary data analysis is commonly performed by software provided by the data collection instrument vendor and it is the first place where quality assessment about a sequencing run takes place.

Secondary data analysis creates the data sets that will be further used to develop scientific information. This step involves aligning the sequences from the primary data analyses to reference data. Reference data can be complete genomes, subsets of genomic data like expressed genes, or individual chromosomes. Reference data are chosen in an application specific manner and sometimes multiple reference data sets will be used in an iterative fashion.

Secondary data analysis has two objectives. The first is to determine the quality of the DNA library that was sequenced, from a biological and sample perspective. The primary data analysis supplies quality measurements that can used to determine if the instrument ran properly, or whether the density of beads or clusters were at their optimum to deliver the highest number of high quality reads. However, those data do not tell you about the quality of the samples. Answering questions about sample quality, such as did the DNA library contain systematic artifacts such as sequence bias? Were there high numbers of ligated adaptors or incomplete restriction enzyme digests, or any other factors that would interfere with interpreting the data? These kinds of questions are addressed in the secondary data analysis by aligning your reads to the reference data and seeing that your data make sense.

The second objective of secondary data analysis is to prepare the data sets for tertiary analysis where they will be compared in an experimental fashion. This step involves further manipulation of alignments, typically expressed in very large hard to read algorithm specific tables, to produce data tables that can be consumed by additional software. Speaking of algorithms, there is a large and growing list to choose from. Some are general purpose and others are specific to particular applications, we'll comment more on that later.

Tertiary data analysis represents the third phase of the Next Gen workflow. This phase may involve a simple activity like viewing a data set in a tool like a genome browser so that the frequency of tags can be used to identify promoter sites, patterns of variation, or structural differences. In other experiments, like digital gene expression, tertiary analysis can involve comparing different data sets in a similar fashion to microarray experiments. These kinds of analyses are the most complex; expression measurements need to be normalized between data sets and statistical comparisons need to be made to assess differences.

To summarize, the goal of primary and secondary analysis is to produce well-characterized data sets that can be further compared to obtain scientific results. Well-characterized means that the quality is good for both the run and the samples and that any biologically relevant artifacts are identified, limited, and understood. The workflows for these analyses involve many steps, multiple scientific algorithms, and numerous file formats. The choices of algorithms, data files, data file formats, and overall number of steps depend the kinds of experiments and assays being performed. Despite this complexity there are standard ways to work with Next Gen systems to understand what you have before progressing through each phase.

The Geospiza FinchLab system focuses on helping you with both primary and secondary data analysis.