Genetic analysis workflows are complex. You can expect that things will go wrong in the laboratory. Biology also manages to interfere and make things harder than you think they should be. Your workflow management system needs to show the relevant data, allow you to observe trends, and have flexible points were procedures can be repeated.
In the last few posts, I introduced genetic analysis workflows, concepts about lab and data workflows, and discussed why it is important to link the lab and data workflows. In this post I expand on the theme and show how a workflow system like the Geospiza FinchLab can be used to troubleshoot laboratory processes.
First, I'll review our figure from last week. Recall that it summarized 4608 paired forward / reverse sequence reads. Samples are represented by rows, and amplicons by column, so that each cell represents a single read from a sample and one of its amplicons. Color is used to indicate quality, with different colors showing the the number of Q20 bases divided by the read length (Q20/rL). Green is used for values between 0.60 and 1.00, blue for values between 0.30 and 0.59, and red for values less than 0.29. The summary showed patterns that, indicated lab failures and biological issues. You were asked to figure them out. Eric from seqanswers (a cool site for Next Gen info) took a stab at this, and got part of the puzzle solved.
Rows 1,2 and 7,8 show failed samples. We can spot this because of the red color across all the amplicons. Either the DNA preps failed to produce DNA, or something interfered with the PCR. Of course there are those pesky working reactions for both forward and reverse sequence in sample 1 column 8. My first impression is that there is a tracking issue. The sixth column also has one reaction that worked. Could this result indicate a more serious problem in sample tracking?
In addition to the red rows, some columns show lots of blue spots; these columns correspond to amplicons 7, 24 and 27. Remember that blue is an intermediate quality. An intermediate quality could be obtained if part of the sequence is good and part of the sequence is bad. Because the columns represent amplicons, when we see a pattern in a column it likely indicates s systematic issue for that amplicon. For example, in column 7, all of the data are intermediate quality. Columns 24 and 27 are more interesting because the striping pattern indicates that one sequencing reaction results in data with intermediate quality while the other looks good. Wouldn't it be great if we could drill down from this pattern and see a table of quality plots and also get to the sequence traces?
Getting to the bottom
In FinchLab we can drill down and view the underlying data. The figure below summarizes the data for amplicon 24. The panel on the left is the expanded heat map for the data set. The panel on right is a folder report summarizing the data from 192 reads for amplicon 24. It contains three parts: An information table that provides an overview of the details for the reads. A histogram plot that counts how many reads have a certain range of Q20 values, and a data table that summarizes each read in a row containing its name, the number of edit revisions, its Q20, Q20/rLen values, and a thumbnail quality plot showing the quality values for each base in the read. In the histogram, you can see that two distinct peaks are observed. About half the data have low Q20 values and half have high Q20 values, producing the striping pattern in the heat map. The data table shows two reads; one is the forward sequence and the other is its "reverse" pair. These data were brought together using the table's search function, in the "finder" bar. Note how the reads could fit together if one picture was reversed.
Could something in the sequence be interfering with the sequencing reaction?
To explore the data further, we need to look at the sequences themselves. We can do this by clicking the name and viewing the trace data online in our web browser, or we could click the FinchTV icon and view the sequence in FinchTV (bottom panel of the figure above). When we do this for the top read (left most trace) we see that, sure enough, there is a polyT track that we are not getting through. During PCR such regions can cause "drop outs" and result in mixtures of molecules that differ in size by one or two bases. A hallmark of such a problem is a sudden drop in data quality at the end of the poly nucleotide track because the mixture of molecules creates a mess of mixed bases. This explanation confirmed by the other read. When we view it in FinchTV (right most trace) we see poor data at the end of the read. Remember these data are reversed relative to the first read so when we reverse complement the trace (middle trace), we see that it "fits" together with the first read. A problem for such amplicons is that we now have only single stranded coverage. Since this problem occurred at the end of the read, half of the data are good and the other half are poor quality. If the problem occurred in the middle of the read, all of the data would show an intermediate quality like amplicon 7.
In genetic analysis data quality values are an important tool for assessing many lab and sample parameters. In this example, we were able to see systematic sample failures and sequence characteristics that can lead to intermediate quality data. We can use this information to learn about biological issues that interfere with analysis. But what about our potential tracking issue?
How might we determine if our samples are being properly tracked?