Friday, April 25, 2008

Managing Digital Gene Expression Workflows with FinchLab

Last Wed (4/23) Illumina hosted a Geospiza presentation featuring how FinchLab supports mRNA tag profiling experiments. We had a great turnout and the presentation is posted on the Illumina web site.

In the webninar we talked about:
  • Next Gen sequencing applications
  • How the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive by looking at some features of mRNA Tag Profiling data sets with FinchLab
  • Setting up and tracking laboratory workflows with FinchLab
  • Why it is important to link the laboratory work and data analysis work
  • Setting up data analysis and reviewing results with FinchLab
  • Using hosted solutions to overcome the significant data management challenges that accompany Next Gen technologies
Over the coming weeks and months we'll explore the above points through multiple posts. In the meantime, get the presentation and enjoy.

From Sample to Results: Managing Illumina Data Workflow with FinchLab

Monday, April 21, 2008

Sneak Peak: Managing Next Gen Digital Gene Expression Workflows

This Wednesday, April 23rd, Illumina will host a webinar featuring the Geospiza FinchLab.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing how the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we will talk about the general applications of Next Gen sequencing and focus on using the Illumina Genome Analyzer to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling. Throughout the talk we will give specific examples about collecting and analyzing tag profiling data and show how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Wednesday, April 16, 2008

Expectations Set the Rules

Genetic analysis workflows are complex. Biology is non-deterministic, so we continually experience new problems. Lab processes and our data have natural uncertainty. These factors conspire against us to make our world rich in variability and processes less than perfect.

That keeps things interesting.

In a previous post, I was able to show how sequence quality values could be used to summarize the data for a large resequencing assay. Presenting "per read" quality values in a grid format allowed us to visualize samples that had failed as well as observe that some amplicons contained repeats that led to sequencing artifacts. We also were able to identify potential sample tracking issues and left off with an assignment to think about how we might further test sample tracking in the assay.

When an assay is developed there are often certain results that can be expected. Some results are defined explicitly with positive and negative controls. We can also use assay results to test that the assay is producing the right kinds of information. Do the data make sense? Expectations can be derived from the literature, an understanding of statistical outcomes, or internal measures.

Genetic assays have common parts

A typical genetic resequencing assay is developed from known information. The goal is to collect sequences from a defined region of DNA for a population of individuals (samples) and use the resulting data to observe the frequency of known differences (variants) and identify new patterns of variation. Each assay has three common parts:

Gene Model - Resequencing and genotyping projects involve comparative analysis of new data (sequences, genotypes) to reference data. The Gene Model can be a chromosomal region or specific gene. A well-developed model will include all known genotypes, protein variations, and phenotypes. The Gene Model represents both community (global) and laboratory (local) knowledge.

Assay Design - The Assay Design defines the regions in the Gene Model that will be analyzed. These regions, typically prepared by PCR are bounded by unique DNA primer sequences. The PCR primers have two parts: one part is complementary to the reference sequence (black in the figure), the other part is "universal" and is complementary to a sequencing primer (red in the figure). The study includes information about patient samples such as their ethnicity, collection origin, and phenotypes associated with the gene(s) under study.

Experiments / Data Collection / Analysis - Once the study is designed and materials arrive, samples are prepared for analysis. PCR is used to amplify specific regions for sequencing or genotyping. After a scientist is confident that materials will yield results, data collection begins. Data can be collected in the lab or the lab can outsource their sequencing to core labs or service companies. When data are obtained, they are processed, validated, and compared to reference data.

Setting expectations

A major challenge for scientists doing resequencing and genotyping projects arises when trying to evaluate data quality and determine the “next steps.” Rarely does everything work. We've already talked about read quality, but there are also the questions of whether the data are mapping to their expected locations, and whether the frequencies of observed variation are expected. The Assay Design can be used to verify experimental data.

The Assay Design tells us where the data should align and how much variation can be expected. For example, if the average SNP frequency is 1/1300 bases, and an average amplicon length is 580 bases, we should expect to observe one SNP for every two amplicons. Furthermore, in reads where a SNP may be observed, we will see the difference in a subset of the data because some, or most, of the reads will have the same allele as the reference sequence.

To test our expectations for the assay, the 7488 read data set is summarized in a way that counts the frequency of disagreements between read data and their reference sequence. The graph below shows a composite of read discrepancies (blue bar graph) and average Q20/rL, Q30/rL, Q40/rL values (colored line graphs). Reads are grouped according to the number of discrepancies observed (x-axis). For each group, the count of reads (bar height) and average Q20/rL (green triangles), Q30/rL (yellow squares), and Q40/rL (purple circles) are displayed.

In the 7488 read data set, 95% (6914) of the reads gave alignments. Of the aligned data, 82% of the reads had between 0 and 4 discrepancies. If we were to pick which traces to review and which to samples to redo, we would likely focus our review on the data in this group and queue the rest (18%) for redos to see if we could improve the data quality.

Per our previous prediction, most of the data (5692 reads) do not have any discrepancies. We also observe that the number of discrepancies increases as the overall data quality decreases. This is expected because the quality values are reflecting the uncertainty (error) in the data.

Spotting tracking issues

We can also use our expectations to identify sample tracking issues. Once an assay is defined, the positions of all of the PCR primers are known, hence we should expect that our sequence data will align to the reference sequence in known positions. In our data set, this is mostly true. Similar to the previous quality plots of samples and amplicons, an alignment "quality" can be defined and displayed in a table where the rows are samples and columns are amplicons. Each sample has two rows (one forward and one reverse sequence). If the cells are colored according to alignment start positions (green for expected, red for unexpected, white for no alignment) we can easily spot which reads have an "unexpected" alignment. The question then becomes, where/when did the mixup occur?

From these kinds of analyses we can get a feel for whether a project is on track and whether there are major issues that will make our lives harder. In future posts I will comment on other kinds of measures that can be made and show you how this work can be automated in FinchLab.

Monday, April 14, 2008

Digital Gene Expression with Next Gen Sequencing

Next Gen Sequencing is changing how we approach problems ranging from whole genome shotgun sequencing, to variation analysis, to gene expression, to structural genomics. Next week, April 23rd, Geospiza will present a webinar on managing Digital Gene Expression experiments and data with FinchLab. The webinar is hosted by Illumina as part of their ongoing webinar series on Next Gen sequencing.


Next Gen sequencers enable researchers to perform new and exciting experiments like digital gene expression. Next Gen sequencers, however, also expose researchers to unprecedented experimental data volume and the need for new tools to support these projects. A single run of the Illumina Genome Analyzer, for example, can generate terabytes of data and 100s of thousands of files. To manage these projects effectively, researchers will need new software systems to quickly track samples, access and analyze the key results files produced by these runs and focus on the science, rather than IT.

In this webinar, Geospiza will demonstrate how the FinchLab Next Gen Edition workflow software can be used track samples, quality review data, and characterize the biological significance of an Illumina dataset while streamlining the entire process from sample to result for a Digital Gene Expression experiment.

Hope to see you there.

Tuesday, April 8, 2008

Exceptions are the Rule

Genetic analysis workflows are complex. You can expect that things will go wrong in the laboratory. Biology also manages to interfere and make things harder than you think they should be. Your workflow management system needs to show the relevant data, allow you to observe trends, and have flexible points were procedures can be repeated.

In the last few posts, I introduced genetic analysis workflows, concepts about lab and data workflows, and discussed why it is important to link the lab and data workflows. In this post I expand on the theme and show how a workflow system like the Geospiza FinchLab can be used to troubleshoot laboratory processes.

First, I'll review our figure from last week. Recall that it summarized 4608 paired forward / reverse sequence reads. Samples are represented by rows, and amplicons by column, so that each cell represents a single read from a sample and one of its amplicons. Color is used to indicate quality, with different colors showing the the number of Q20 bases divided by the read length (Q20/rL). Green is used for values between 0.60 and 1.00, blue for values between 0.30 and 0.59, and red for values less than 0.29. The summary showed patterns that, indicated lab failures and biological issues. You were asked to figure them out. Eric from seqanswers (a cool site for Next Gen info) took a stab at this, and got part of the puzzle solved.

Sample issues

Rows 1,2 and 7,8 show failed samples. We can spot this because of the red color across all the amplicons. Either the DNA preps failed to produce DNA, or something interfered with the PCR. Of course there are those pesky working reactions for both forward and reverse sequence in sample 1 column 8. My first impression is that there is a tracking issue. The sixth column also has one reaction that worked. Could this result indicate a more serious problem in sample tracking?

Amplicon issues

In addition to the red rows, some columns show lots of blue spots; these columns correspond to amplicons 7, 24 and 27. Remember that blue is an intermediate quality. An intermediate quality could be obtained if part of the sequence is good and part of the sequence is bad. Because the columns represent amplicons, when we see a pattern in a column it likely indicates s systematic issue for that amplicon. For example, in column 7, all of the data are intermediate quality. Columns 24 and 27 are more interesting because the striping pattern indicates that one sequencing reaction results in data with intermediate quality while the other looks good. Wouldn't it be great if we could drill down from this pattern and see a table of quality plots and also get to the sequence traces?

Getting to the bottom

In FinchLab we can drill down and view the underlying data. The figure below summarizes the data for amplicon 24. The panel on the left is the expanded heat map for the data set. The panel on right is a folder report summarizing the data from 192 reads for amplicon 24. It contains three parts: An information table that provides an overview of the details for the reads. A histogram plot that counts how many reads have a certain range of Q20 values, and a data table that summarizes each read in a row containing its name, the number of edit revisions, its Q20, Q20/rLen values, and a thumbnail quality plot showing the quality values for each base in the read. In the histogram, you can see that two distinct peaks are observed. About half the data have low Q20 values and half have high Q20 values, producing the striping pattern in the heat map. The data table shows two reads; one is the forward sequence and the other is its "reverse" pair. These data were brought together using the table's search function, in the "finder" bar. Note how the reads could fit together if one picture was reversed.

Could something in the sequence be interfering with the sequencing reaction?

To explore the data further, we need to look at the sequences themselves. We can do this by clicking the name and viewing the trace data online in our web browser, or we could click the FinchTV icon and view the sequence in FinchTV (bottom panel of the figure above). When we do this for the top read (left most trace) we see that, sure enough, there is a polyT track that we are not getting through. During PCR such regions can cause "drop outs" and result in mixtures of molecules that differ in size by one or two bases. A hallmark of such a problem is a sudden drop in data quality at the end of the poly nucleotide track because the mixture of molecules creates a mess of mixed bases. This explanation confirmed by the other read. When we view it in FinchTV (right most trace) we see poor data at the end of the read. Remember these data are reversed relative to the first read so when we reverse complement the trace (middle trace), we see that it "fits" together with the first read. A problem for such amplicons is that we now have only single stranded coverage. Since this problem occurred at the end of the read, half of the data are good and the other half are poor quality. If the problem occurred in the middle of the read, all of the data would show an intermediate quality like amplicon 7.

In genetic analysis data quality values are an important tool for assessing many lab and sample parameters. In this example, we were able to see systematic sample failures and sequence characteristics that can lead to intermediate quality data. We can use this information to learn about biological issues that interfere with analysis. But what about our potential tracking issue?

How might we determine if our samples are being properly tracked?

Friday, April 4, 2008

Lab work without data analysis and management is doo doo

As we begin to contemplate next generation sequence data management, we can use Sanger sequencing to teach us important lessons. One of which, is the value of linking laboratory and data workflows to be able to view information in the context of our assays and experiments.

I have been fortunate to hear J. Michael Bishop speak on a couple of occasions. He ended these talks by quoting one of his biochemistry mentors, "genetics without biochemistry is doo doo." In a similar vein, lab work without data analysis and management is doo doo. That is when you separate the lab from the data analysis, you have to work through a lot of doo to figure things out. Without a systematic way to view summaries of large data sets, the doo is overwhelming.

To illustrate, I am going to share some details about a resequencing project we collaborated on. We came to this project late, so much of the data had been collected, and there were problems, lots of doo. Using Finch however, we could quickly organize and analyze the data, and present information in summaries with drill downs to the details to help troubleshoot and explain observations that were seen in the lab.

10,686 sequence reads: forward / reverse sequences from 39 amplicons from 137 individuals

The question being asking in this project was: are there new variants in a gene that are related to phenotypes observed in a specialized population? This is the kind of question medical researchers ask frequently. Typically they have a unique collection of samples that come from a well understood population of individuals. Resequencing is used to interrogate the samples for rare variants, or genotypes.

In this process, we purify DNA from sample material (blood), and use PCR with exon specific probes to amplify small regions of DNA within the gene. The PCR primers have regions called universal adaptors. Our sequencing primers will bind to those regions. Each PCR product, called an amplicon, is sequenced twice, once from each strand to give double coverage of the bases.

When we do the math, we will have to track the DNA for 137 samples and 5343 amplicons. Each amplicon is sequenced, at a minimum twice, to give us 10,686 reads. From a physical materials point of view that means 137 tubes with sample; 56, 96-well plates for PCR; and 112, 96-well plates for sequencing. In a 384-well format we could have used 14 plates for PCR and 28 plates for sequencing. For a genome center, this level of work is trivial, but for a small lab this is significant work and things can happen. Indeed as not all the work is done in a single lab the process can be more complex. And you need to think about how you would lay this out - 96 does not divide by 39 very well.

From a data perspective, we can use sequence quality values to identify potential laboratory and biological issues. The figure below summarizes 4608 reads. Each pair of rows is one sample (forward / reverse sequence pairs, alternating gray and white - 48 total). Each column is an amplicon. Each cell in the table represents a single read from an amplicon and sample. Color is used to indicate quality. In this analysis, quality is defined as the ratio of Q20 to read length (Q20/rL), which works very well for PCR amplicons. The better the data, the closer this ratio is to one. In the table below, green indicates Q20/rL values between 0.60 and 1.00, blue indicates values between 0.30 and 0.59, and red indicates Q20/rL values less than 0.29. The summary shows patterns that, as we will learn next week, show lab failures and biological issues. See if you can figure them out.

Wednesday, April 2, 2008

Working with Workflows

Genetic analysis workflows involve both complex laboratory and data analysis and manipulation procedures. A good workflow management system not only tracks processes, but simplifies the work.

In my last post , I introduced the concept of workflows in describing the issues one needs to think about as they prepare their lab for Next Gen sequencing. To better understand these challenges, we can learn from previous experience with Sanger sequencing in particular and genetic assays in general.

As we know, DNA sequencing serves many purposes. New genomes and genes in the environment are characterized and identified by De Novo sequencing. Gene expression can be assessed by measuring Expressed Sequence Tags (ESTs), and DNA variation and structure can be investigated by resequencing regions of known genomes. We also know that gene expression and genetic variation can also be studied with multiple technologies such as hybridization, fragment analysis, and direct genotyping and it is desirable to use multiple methods to confirm results. Within each of these general applications and technology platforms, specific laboratory and bioinformatics workflows are used to prepare samples, determine data quality, study biology, and predict biological outcomes.

The process begins in the laboratory.

Recently I came across a Wikipedia article on DNA sequencing that had a simple diagram showing the flow of materials from samples to data. I liked this diagram, so I reproduced it, with modifications. We begin with the sample. A sample is a general term that describes a biological material. Sometimes, like when you are at the doctor, these are called specimens. Since biology is all around and in us, samples come from anything that we can extract DNA or RNA from. Blood, organ tissue, hair, leaves, bananas, oysters, cultured cells, feces, you-can-image-what-else, can all be samples for genetic analysis. I know a guy who uses a 22 to collect the apical meristems from trees to study poplar genetics. Samples come from anywhere.

With our samples in hand, we can perform genetic analyses. What we do next depends on what we want to learn. If we want to sequence a genome we're going to prepare a DNA library by randomly shearing the genomic DNA and cloning the fragments into sequencing vectors. The purified cloned DNA templates are sequenced and the data we obtain are assembled into larger sequences (contigs) until, hopefully, we have a complete genome. In resequencing and other genetic assays, DNA templates are prepared from sample DNA by amplifying specific regions of a genome with PCR. The PCR products, amplicons, are sequenced and the resulting data are compared to a reference sequence to identify differences. Gene expression (EST and hybridization) analysis follows similar patterns except that RNA is purified from samples and then converted to cDNA using RT-PCR (Reverse Transcriptase PCR, not Real Time PCR - that's a genetic assay).

From a workflow point of view, we can see how the physical materials change throughout the process. Sample material is converted to DNA or RNA (nucleic acids), and the nucleic acids are further manipulated to create templates that are used for the analytical reaction (DNA sequencing, fragment analysis, RealTime-PCR, ...). As the materials flow through the lab, they're manipulated in a variety of containers. A process may begin with a sample in a tube, use a petri plate to isolate bacterial colonies, 96-well plates to purify DNA and perform reactions, and 384-well plates to collect sequence data. The movement of the materials must be tracked, along with their hierarchical relationships. A sample may have many templates that are analyzed, or a template may have multiple analyses. When we do this a lot we need a way to see where our samples are in their particular processes. We need a workflow management system, like FinchLab.