FinchTalk: Workflow

Showing posts with label Workflow. Show all posts

Sunday, November 8, 2009

Expeditiously Exponential: Data Sharing and Standardization

We can all agree that our ability to produce genomics and other kinds of data is increasing at exponential rates. Less clear, is understanding the consequences for how these data will be shared and ultimately used. These topics were explored in last month's (Oct. 9, 2009) policy forum feature in the journal Science.

The first article, listed under the category "megascience," dealt with issues about sharing 'omics data. The challenge being that systems biology research demands that data from many kinds of instrument platforms (DNA sequencing, mass spectrometry, flow cytometry, microscopy, and others) be combined in different ways to produce a complete picture of a biological system. Today, each platform generates its own kind of "big" data that, to be useful, must be computationally processed and transformed into standard outputs. Moreover, the data are often collected by different research groups focused on particular aspects of a common problem. Hence, the full utility of the data being produced can only be realized when the data are made open and shared throughout the scientific community. The article listed past efforts in developing sharing policies and the central table included 12 data sharing policies that are already in effect.

Sharing data solves half of the problem, the other aspect is being able to use the data once shared. This requires that data be structured and annotated in ways that make it understandable by a wide range of research groups. Such standards typically include minimum information check lists that define specific annotations, and which data should be kept from different platforms. The data and metadata are stored in structured documents that reflect a community's view about what is important to know with respect to how data were collected and the samples the data were collected from. The problem is that annotation standards are developed by diverse groups and, like the data, are expanding. This expansion creates new challenges with making data interoperable; the very problem standards try to address.

The article closed with high-level recommendations for enforcing policy through funding and publication requirements and acknowledged that full compliance requires that general concerns with pre-publication data use and patient information be addressed. More importantly, the article acknowledged that meeting data sharing and formatting standards has economic implications. That is, researches need time-efficient data management systems, the right kinds of tools and informatics expertise to meet standards. We also need to develop the right kind of global infrastructure to support data sharing.

Fortunately complying with data standards is an area where Geospiza can help. First, our software systems rely on open, scientifically valid tools and technologies. In DNA sequencing we support community developed alignment algorithms. The statistical analysis tools in GeneSifter Analysis Edition utilize R and BioConductor to compare gene expression data from both microarrays and DNA sequencing. Further, we participate in the community by contributing additional open-source tools and standards through efforts like the BioHDF project. Second, the GeneSifter Analysis and Laboratory platforms provide the time-effiecient data management solutions needed to move data through its complete life cycle from collection, to intermediate analysis, to publishing files in standard formats.

GeneSifter lowers researcher's economic barriers of meeting data sharing and annotation standards keep the focus on doing good science with the data.

Sunday, November 1, 2009

GeneSifter Laboratory Edition Update

GeneSifter Laboratory Edition has been updated to version 3.13. This release has many new features and improvements that further enhance its ability to support all forms of DNA sequencing and microarray sample processing and data collection.

Geospiza Products

Geospiza's two primary products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), form a complete software system that supports many kinds of genomics and genetic analysis applications. GSLE is the LIMS (Laboratory Information Management System) that is used by core labs and service companies worldwide that offer DNA sequencing (Sanger and Next Generation), microarray analysis, fragment analysis and other forms of genotyping. GSAE is the analysis system researchers use to analyze their data and make discoveries. Both products are actively updated to keep current with latest science and technological advances.

The new release of GSLE helps labs share workflows, perform barcode-based searching, view new data reports, simplify invoicing, and automate data entry through a new API (application programming interface).

Sharing Workflows

GSLE laboratory workflows make it possible for labs to define and track their protocols and data that are collected when samples are processed. Each step in a protocol can be configured to collect any kind of data, like OD values, bead counts, gel images and comments, that are used to record sample quality. In earlier versions, protocols could be downloaded as PDF files that list the steps and their data. With 3.13, a complete workflow (steps, rules, custom data) can be downloaded as an XML file that can be uploaded into another GSLE system to recreate the entire protocol with just a few clicks. This feature simplifies protocol sharing and makes it possible for labs to test procedures in one system and add them to another when they are ready for production.

Barcode Searching and Sample Organization

Sometimes a lab needs to organize separate tubes in 96-well racks for sample preparation. Assigning each tube's rack location can be an arduous process. However, if the tubes are labeled with barcode identifiers, a bed scanner can be used to make the assignments. GSLE 3.13 provides an interface to upload bed scanner data and assign tube locations in a single step. Also, new search capabilities have been added to find orders in the system using sample or primer identifiers. For example, orders can be retrieved by scanning a barcode from a tube in the search interface.

 Reports and Data

Throughout GSLE, many details about data can be reviewed using predefined reports. In some cases, pages can be quite long, but only a portion of the report is interesting. GSLE now lets you collapse sections of report pages to focus on specific details. New download features have also been added to better support access to those very large NGS data files.

GSLE has always been good at identifying duplicate data in the system, but not always as good at letting you decide how duplicate data are managed. Managing duplicate data is now more flexible to better support situations where data need to be reanalyzed and reloaded.

The GSLE data model makes it possible to query the database using SQL. In 3.13, the view tables interface has been expanded so that the data stored in each table can be reviewed with a single click.

Invoices

Core lab's that send invoices will benefit from changes that make it possible to download many PDF formatted orders and invoices into a single zipped folder. Configurable automation capabilities have also been added to set invoice due dates and generate multiple invoices from a set of completed orders.

API Tools

As automation and system integration needs increase, external programs are used to enter data from other systems. GSLE 3.13 supports automated data entry through a novel self-documenting API. The API takes advantage of GSLE's built in data validation features that are used by the system's web-based forms. At each site, the API can be turned on and off by on-site administrators and its access can be limited to specific users. This way, all system transactions are easily tracked using existing GLSE logging capabilities. In addition to data validation and access control, the API is self-documenting. Each API containing form has a header that includes key codes, example documentation, and features to view and manually upload formatted data to test automation programs and help system integrators get their work done. GSLE 3.13 further supports enterprise environments with an improved API that is used to query external password authentication servers.

Thursday, October 8, 2009

Resequencing and Cancer

Yesterday we released news about new funding from NIH for a project to work on ways to improve how variations between DNA sequences are detected using Next Generation Sequencing (NGS) technology. The project emphasizes detecting rare variation events to improve cancer diagnostics, but the work will support a diverse range of resequencing applications.

Why is this important?

In October 2008, the U.S. News and World Report published an article by Bernadine Healy, former head of NIH. The tag line “understanding the genetic underpinnings of cancer is a giant step toward personalized medicine,” (1) underscores how the popular press views the promise of recent advances in genomics technology in general, and the progress toward understanding the molecular basis of cancer. In the article, Healy presents a scenario where, in 2040, a 45-year-old woman, who has never smoked, develops lung cancer. She undergoes outpatient surgery, and her doctors quickly scrutinize the tumor’s genes and use a desktop computer to analyze the tumor genomes, and medical records to create a treatment plan. She is treated, the cancer recedes and subsequent checkups are conducted to monitor tumor recurrence. Should a tumor be detected, her doctor would quickly analyze the DNA of a few of the shed tumor cells and prescribe a suitable next round of therapy. The patient lives a long happy life, and keeps her hair.

This vision of successive treatments based on genomic information is not unrealistic, claims Healy, because we have learned that while many cancers can look homogeneous in terms of morphology and malignancy they are indeed highly complex and varied when examined at the genetic level. The disease of cancer is in reality a collection of heterogeneous diseases that, even for common tissues like the prostate, can vary significantly in terms of onset and severity. Thus, it is often the case that cancer treatments, based on tissue type, fail, leaving patients to undergo a long painful process of trial and error therapies with multiple severely toxic compounds.

Because cancer is a disease of genomic alterations, understanding the sources, causes, and kinds of mutations, and their connection to specific types of cancer, and how they may predict tumor growth is worthwhile. The human cancer genome project (2) and initiatives like the international cancer genome consortium (3) have demostrated this concept. The kinds of mutations found in tumor populations, thus far by NGS, include single nucleotide polymorphisms (SNPs), insertions and deletions, and small structural copy number variations (CNVs) (4, 5). From early studies it is clear that a greater amount of genomic information will be needed to make Healy's scenario a reality. Next generation sequencing (NGS) technologies will drive this next phase of research and enable our deeper understanding.

Project Synopsis

The great potential for the clinical applications of new DNA sequencing technologies comes from their highly sensitive ability to assess genetic variation. However, to make these technologies clinically feasible, we must assay patient samples at far higher rates than can be done with current NGS procedures. Today, the experiments applying NGS, in cancer research have investigated small numbers of samples in great detail, in some cases comparing entire genomes from tumor and normal cells from a single patient (6-8). These experiments show, that when a region is sequenced with sufficient coverage, numerous mutations can be identified.

To move NGS technologies into clinical use many costs must decrease. Two ways costs can be lowered are to increase sample density and reduce the number of reads needed per sample. Because cost is a function of turnaround time and read coverage, and read coverage is a function of the signal to noise ratio, assays with a higher background noise, due to errors in the data, will require higher sampling rates to detect true variation and be more expensive. To put this in context, future cancer diagnostic assays will likley need to look at over 4000 exons per test. In cases like bladder cancer, or cancers where stool or blood are sampled, non-invasive tests will need to detect variations in one out of 1000 cells. Thus it is extremely important that we understand signal/noise ratios and to be able to calculate read depth in a reliable fashion.

Currently we have a limited understanding of how many reads are needed to detect a given rare mutation. Detecting mutations depends on a combination of sequencing accuracy and depth of coverage. The signal (true mutations) to noise (false mutations, hidden mutations) depends on how many times we see a correct result. Sequencing accuracy is affected by multiple factors that include sample preparation, sequence context, sequencing chemistry, instrument accuracy, and basecalling software. The current depth-of-coverage calculations are based on an assumption that sampling is random, which is not valid in the real world. Corrections will have to be applied to adjust for real-world sampling biases that affect read recovery and sequencing error rates (9-11).

Developing clinical software systems that can work with NGS technologies, to quickly and accurately detect rare mutations, requires a deep understanding of the factors that affect the NGS data collection and interpretation. This information needs to be integrated into decision control systems that can, through a combination of computation and graphical displays, automate and aid a clinician’s ability to verify and validate results. Developing such systems are major undertakings involving a combination of research and development in the areas of laboratory experimentation, computational biology, and software development.

Positioned for Success

Detecting small genetic changes in clinical samples is ambitious. Fortunately, Geospiza has the right products to deliver on the goals of the research. GeneSifter Lab Edition handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. The data management and analysis server make the system scalable through a distributed architecture. Combined with GeneSifter Analysis Edition, a complete platform is created to rapidly prototype new data analysis workflows needed to test new analysis methods, experiment with new data representations, and iteratively develop data models to integrate results with experimental details.

References:

Press Release: Geospiza Awarded SBIR Grant for Software Systems for Detecting Rare Mutations

1. Healy, 2008. "Breaking Cancer's Gene Code - US News and World Report" http://health.usnews.com/articles/health/cancer/2008/10/23/breaking-cancers-gene-code_print.htm

2. Working Group, 2005. "Recommendation for a Human Cancer Genome Project" http://www.genome.gov/Pages/About/NACHGR/May2005NACHGRAgenda/ReportoftheWorkingGrouponBiomedicalTechnology.pdf

3. ICGC, 2008. "International Cancer Genome Consortium - Goals, Structure, Policies &Guidelines - April 2008" http://www.icgc.org/icgc_document/

4. Jones S., et. al., 2008. "Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses." Science 321, 1801.

5. Parsons D.W., et. al., 2008. "An Integrated Genomic Analysis of Human Glioblastoma Multiforme." Science 321, 1807.

6. Campbell P.J., et. al., 2008. "Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing." Proc Natl Acad Sci U S A 105, 13081-13086.

7. Greenman C., et. al., 2007. "Patterns of somatic mutation in human cancer genomes." Nature 446, 153-158.

8. Ley T.J., et. al., 2008. "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome." Nature 456, 66-72.

9. Craig D.W., et. al., 2008. "Identification of genetic variants using bar-coded multiplexed sequencing." Nat Methods 5, 887-893.

10. Ennis P.D.,et. al., 1990. "Rapid cloning of HLA-A,B cDNA by using the polymerase chain reaction: frequency and nature of errors produced in amplification." Proc Natl Acad Sci U S A 87, 2833-2837.

11. Reiss J., et. al., 1990. "The effect of replication errors on the mismatch analysis of PCR-amplified DNA." Nucleic Acids Res 18, 973-978.

Sunday, March 8, 2009

Bloginar: Next Gen Laboratory Systems for Core Facilities

Geospiza kicked off February by attending the AGBT and ABRF conferences. As part of our participation at ABRF, we presented a scenario, in our poster, where a core lab provides Next Generation Sequencing (NGS) transcriptome analysis services. This story shows how GeneSifter Lab and Analysis Edition’s capabilities overcome the challenges of implementing NGS in a core lab environment.

Like the last post, which covered our AGBT poster, the following poster map will guide the discussion.

As this poster overlaps the previous poster in terms providing information about RNA assays and analyzing the data, our main points below will focus on how GeneSifter Lab Edition solves challenges related to laboratory and business processes associated with setting up a new lab for NGS or bringing NGS into an existing microarray or Sanger sequencing lab.

Section 1 contains the abstract, an introduction to the core laboratory, and background information on different kinds of transcription profiling experiments.

The general challenge for a core lab lies in the need to run a business that offers a wide variety of scientific services for which samples (physical materials) are converted to data and information that have biological meaning. Different services often require different lab processes to produce different kinds of data. To facilitate and direct lab work, each service requires specialized information and instructions for samples that will be processed. Before work is started, the lab must review the samples and verify that the information has been correctly delivered. Samples are then routed through different procedures to prepare them for data collection. In the last steps, data are collected, reviewed, and the results are delivered back to clients. At the end of the day (typically monthly), orders are reviewed and invoices are prepared either directly or by updating accounting systems.

In the case of NGS, we are learning that the entire data collection and delivery process gets more complicated. When compared to Sanger sequencing, genotyping, or other assays that are run in 96-well formats, sample preparation is more complex. NGS requires that DNA libraries be prepared and different steps of the of process need to be measured and tracked in detail. Also, complicated bioinformatics workflows are needed to understand the data from both a quality control and biological meaning context. Moreover, NGS requires a substantial investment in information technology.

Section 2 walks through the ways in which GeneSifter Lab Edition helps to simplify the NGS laboratory operation.

Order Forms

In the first step, an order is placed. Screenshots show how GeneSifter can be configured for different services. Labs can define specialized form fields using a variety of user interface elements like check boxes, radio buttons, pull down menus, and text entry fields. Fields can be required or be optional and special rules such as ranges for values can be applied to individual fields within specific forms. Orders can also be configured to take files as attachments to track data, like gel images, about samples. To handle that special “for lab use only" information, fields in forms can be specified as laboratory use only. Such fields are hidden to the customers view and when the orders are processed they are filled later by lab personnel. The advantage of GeneSifter’s order system is that the pertinent information is captured electronically in the same system that will be used to track sample processing and organize data. Indecipherable paper forms are eliminated along with the problem of finding information scattered on multiple computers.

Web-forms do create a special kind of data entry challenge. Specifically, when there is a lot of information to enter for a lot samples, filling in numerous form fields on a web-page can be a serious pain. GeneSifter solves this problem in two ways:

First, all forms can have “Easy Fill” controls that provide column highlighting (for fast tab-and-type data entry), auto fill downs, and auto fill downs with number increments so one can easily “copy” common items into all cells of a column, or increment an ending number to all values in a column. When these controls are combined with the “Range Selector,” a power web-based user interface makes it easy to enter large numbers of values quickly in flexible ways.

Second, sometimes the data to be entered is already in an Excel spreadsheet. To solve this problem, each form contains a specialized Excel spreadsheet validator. The form can be downloaded as an Excel template and the rules, previously assigned to field when the form was created, are used to check data when they are uploaded. This process spots problems with data items and reports ten at upload time when they are easy to fix, rather than later when information is harder to find. This feature eliminates endless cycles of contacting clients to get the correct information.

Laboratory Processing

Once order data are entered, the next step is to process orders. The middle of section 2 describes this process using an RNA-Seq assay as an example. Like other NGS assays, the RNA-Seq protocol has many steps involving RNA purification, fragmentation, random primed conversion into cDNA, and DNA library preparation of the resulting cDNA for sequencing. During the process, the lab needs to collect data on RNA and DNA concentration as well as determine the integrity of the molecules throughout the process. If a lab runs different kinds of assays they will have to manage multiple procedures that may have different requirements for ordering of steps and laboratory data that need to be collected.

By now it is probably not a surprise to learn that GeneSifter Lab Edition has a way to meet this challenge too. To start, workflows (lab procedures) can be created for any kind of process with any number of steps. The lab defines the number of steps and their order and which steps are required (like the order forms). Having the ability to mix required and optional steps in a workflow gives a lab the ultimate flexibility to support those “we always do it this way, except the times we don’t” situations. For each step the lab can also define whether or not any additional data needs to be collected along the way. Numbers, text, and attachments are all supported so you can have your Nanodrop and Bioanalyzer too.

Next, an important feature of GeneSifter workflows is that a sample can move from one workflow to another. This modular approach means that separate workflows can be created for RNA preparation, cDNA conversion, and sequencing library preparation. If a lab has multiple NGS platforms, or a combination of NGS and microarrays, they might find that a common RNA preparation procedure is used, but the processes diverge when the RNA is converted into forms for collecting data. For example, aliquots of the same RNA preparation may be assayed and compared on multiple platforms. In this case a common RNA preparation protocol is followed, but sub-samples are taken through different procedures, like a microarray and NGS assay, and their relationship to the “parent” sample must be tracked. This kind of scenario is easy to set up and execute in GeneSifter Lab Edition.

Finally, one of GeneSifter’s greatest advantages is that a customized system with all of the forms, fields, Excel import features, and modular workflows can be added by lab operators without any programming. Achieving similar levels of customization with traditional LIMS products takes months and years with initial and reoccurring costs of six or more figures.

Collecting Data

The last step of the process is collecting the data, reviewing it, and making sequences and results available to clients. Multiple screenshots illustrate how this works in GeneSifter Lab Edition. For each kind of data collection platform, a “run” object is created. The run holds the information about reactions (the samples ready to run) and where they will be placed in the container that will be loaded into the data collection instrument. In this context, the container is used to describe 96 or 384-well plates, glass slides with divided areas called lanes, regions, chambers, or microarray chips. All of these formats are supported and in some cases specialized files (sample sheets, plate records) are created and loaded into instrument collection software to inform the instrument about sample placement and run conditions for individual samples.

During the run, samples are converted to data. This process, different for each kind of data collection platform, produce variable numbers and kinds of files that are organized in completely different ways. Using tools that work with GeneSifter, raw data and tracking information are entered into the database to simplify access to the data at a later time. The database also associates sample names and other information with data files, eliminating the need to rename files with complex tracking schemes. The last steps of the process involve reviewing quality information and deciding whether to release data to clients or repeat certain steps of the process. When data are released, each client receives an email directing them to their data.

The lab updates the orders and optionally creates invoices for services. GeneSifter Lab Edition can be used to manage those business functions as well. We’ll cover GeneSifter’s pricing and invoicing tools at some other time, be assured they are as complete as the other parts of the system.

NGS requires more than simple data delivery

Section 3 covers issues related to the computational infrastructure needed to support NGS and the data analysis aspects of the NGS workflow. In this scenario, our core lab also provides data analysis services to convert those multi-million read files into something that can be used to study biology. Much of this covered in the previous post, so it will not be repeated here.

I will summarize by making the final points that Geospiza’s GeneSifter products cover all aspects of setting up a lab for NGS. From sample preparation, to collecting data, to storing and distributing results, to running complex bioinformatics workflows and presenting information in ways to get scientifically meaningful results, a comprehensive solution is offered. GeneSifter products can be delivered as hosted solutions to lower costs. Our hosted, Software as a Service, solutions allow groups to start inexpensively and manage costs as the needs scale. More importantly, unlike in-house IT systems, which require significant planning and implementation time to remodel (or build) server rooms and install computers, GeneSifter products get you started as soon as you decide to sign up.

Wednesday, October 8, 2008

Road Trip: AB SOLiD Users Meeting

Wow! That's the best way to summarize my impressions from the Applied Biosystems (AB) SOLiD users conference last week, when AB launched their V3 SOLiD platform. AB claims that this system will be capable of delivering a human genome's worth of data for about $10,000 US.

Last spring, the race to the $1000 genome leaped forward when AB announced that they sequenced a human genome at 12-fold coverage for $60,000. When the new system ships in early 2009, that same project can be completed for $10,000. Also, this week others have claimed progress towards a $5000 human genome.

That's all great, but what can you do with this technology besides human genomes?

That was the focus of the SOLiD users conference. For a day and a half, we were treated to presentations from scientists and product managers from AB as well as SOLiD customers who have been developing interesting applications. Highlights are described below.

Technology Improvements:

Increasing Data Throughput - Practically everyone is facing the challenge of dealing with large volumes of data, and now we've learned the new version of the SOLiD system will produce even more. A single instrument run will produce between 125 million to 400 million reads depending on the application. This scale up is achieved by increasing the bead density on a slide, dropping the overall cost per individual read. Read lengths are also increasing, making it possible to get between 30 and 40 gigabases of data from a run. And, the amount of time required for each run is shrinking; not only can you get all of these data, you can do it again more quickly.

Increasing Sample Scale - Many people like to say, yes, the data is a problem, but at least the sample numbers are low, so sample tracking is not that hard.

Maybe they spoke too soon.

AB and the other companies with Next Gen technologies are working to deliver "molecular barcodes" that allow researchers to combine multiple samples on a single slide. This is called "multiplexing." In multiplexing, the samples are distinguished by tagging each one with a unique sequence, the barcode. After the run, the software uses the sequence tags to sort the data into their respective data sets. The bottom line is that we will go from a system that generates a lot of data from a few samples, to a system that generates even more data from a lot of samples.

Science:

What you can do with 100's of millions of reads: On the science side, there were many good presentations that focused on RNA-Seq and variant detection using the SOLiD system. Of particular interest was Dr. Gail Payne's presentation on the work, recently published in Genome Research, entitled "Whole Genome Mutational Profiling Using Next Generation Sequencing Technology." In the paper, the 454, Illumina, and SOLiD sequencing platforms were compared for their abilities to accurately detect mutations in a common system. This is one of the first head to head to head comparisons to date. Like the presidential debates, I'm sure each platform will be claimed to be the best by its vendor.

From the presentation and paper, the SOLiD platform does offer a clear advantage in its total throughput capacity. 454 showed showed the long read advantage in that approximately 1.5% more of the yeast genome studied was covered by 454 data than with shorter read technology. And, the SOLiD system, with its dibase (color space) encoding, seemed to provide higher sequence accuracy. When the reads were normalized to the same levels of coverage, a small advantage for SOLiD, can be seen.

When false positive rates of mutation detection were compared, SOLiD had zero for all levels of coverage (6x, 8x, 10x, 20x, 30x, 175x [full run of two slides]), Illumina had two false positives at 6x and 13x, and zero false positives for 19x and 44x (full run of one slide) coverage, and 454 had 17, six, and one false positive for 6x, 8x, and 11x (full run) coverage, respectively.

In terms of false negative (missed) mutations, all platforms did a good job. At coverages above 10x, none of the platforms missed any mutations. The 454 platform missed a single mutation at 6x and 8x coverage and Illumina missed two mutations at 6x coverage. SOLiD, on the other hand, missed four and five at 8x and 6x coverage, respectively.

What was not clear from the paper and data, was the reproducibility of these results. From what I can tell, single DNA libraries were prepared and sequenced; but replicates were lacking. Would the results change if each library preparation and sequencing process was repeated?

Finally, the work demonstrates that it is very challenging to perform a clean "apples to apples" comparison. The 454 and Illumina data were aligned with Mosiak and the SOLiD data were aligned with MapReads. Since each system produces different error profiles and the different software programs each make different assumptions about how to use the error profiles to align data and assess variation, the results should not be over interpreted. I do, however, agree with the authors, that these systems are well-suited for rapidly detecting mutations in a high throughput manner.

ChIP-Seq / RNA-Seq: On the second day, Dr. Jessie Gray presented work on combining ChIP-Seq and RNA-Seq to study gene expression. This is important work because it illustrates the power of Next Gen technology and creative ways in which experiments can be designed.

Dr. Gray's experiment was designed to look at this question: When we see that a transcription factor is bound to DNA, how do we know if that transcription factor is really involved in turning on gene expression?

ChIP-Seq allows us to determine where different transcription factors are bound to DNA at a given time, but it does not tell us whether that binding event turned on transcription. RNA-Seq tells us if transcription is turned on, after a given treatment or point in time, but it doesn't tell us which transcription factors were involved. Thus, if we can combine ChiP-Seq and RNA-Seq measurements, we can elucidate a cause and effect model and find where a transcription factor is binding and which genes it potentially controls.

This might be harder than it sounds:

As I listened to this work, I was struck by two challenges. On the computational side, one has to not only think about how to organize and process the sequence data into alignments and reduce those aligned datasets into organized tables that can be compared, but also how to create the right kind of interfaces for combining and interactively exploring the data sets.

On the biochemistry side, the challenges presented with ChIP-Seq reminded me of the old adage of trying to purify disapearase - "the more you purify the less there is." ChIP-Seq and other assays that involve multiple steps of chemical treatments and purification, produce vanishingly small amounts of material for sampling. The later challenge complicates the first challenge, because in systems where one works with "invisible" amounts of DNA, a lot of creative PCR, like "in gel PCR" is required to generate sufficient quantities of sample for measurement.

PCR is good for many things, including generating artifacts. So, the computation problem expands. A software system that generates alignments, reduces them to data sets that can be combined in different ways, and provides interactive user interfaces for data exploration, must also be able to understand common artifacts so that results can be quality controlled. Data visualizations must also be provided so that researchers can distinguish biological observations from experimental error.

These are exactly the kinds of problems that Geospiza solves.

Wednesday, August 20, 2008

Next Gen DNA Sequencing Is Not Sequencing DNA

In the old days, we used DNA sequencing primarily to learn about the sequence and structure of a cloned gene. As the technology and throughput improved, DNA sequencing became a tool for investigating entire genomes. Today, with the exception of de novo sequencing, Next Gen sequencing has changed the way we use DNA sequences. We're no longer looking for new DNA sequences. We're using Next Gen technologies to perform quantitative assays with DNA sequences as the data points. This is a different way of thinking about the data and it impacts how we think about our experiments, data analysis, and IT systems.

In de novo sequencing, the DNA sequence of a new genome, or genes from the environment is elucidated. De novo sequencing ventures into the unknown. Each new genome brings new challenges with respect to interspersed repeats, large segmented gene duplications, polyploidy and interchromosomal variation. The high redundancy samples obtained from Next Gen technology lower the cost and speed this process because less time is required for getting additional data to fill in gaps and finish the work.

The other ultra high throughput DNA sequencing applications, on the other hand, focus on collecting sequences from DNA or RNA molecules for which we already have genomic data. Generally called "resequencing," these applications involve collecting and aligning sequence reads to genomic reference data. Experimental information is obtained by tabulating the frequency, positional information, and variation of the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw conclusions.

DNA sequences are information rich data points

EST (expressed sequence tag) sequencing was one of the first applications to use sequence data in a quantitative way. In EST applications, mRNA from cells was isolated, converted to cDNA, cloned, and sequenced. The data from an EST library provided both new and quantitative information. Because each read came from a single molecule of mRNA, a set of ESTs could be assembled and counted to learn about gene expression. The composition and number of distinct mRNAs from different kinds of tissues could be compared and used to identify genes that were expressed at different time points during development, in different tissues, and in different disease states, such as cancer. The term "tag" was invented to indicate that ESTs could also be used to identify the genomic location of mRNA molecules. Although the information from EST libraries was been informative, lower cost methods such as microarray hybridization and real time-PCR assays replaced EST sequencing over time, as more genomic information became available.

Another quantitative use of sequencing has been to assess allele frequency and identify new variants. These assays are commonly known as "resequencing" since they involve sequencing a known region of genomic DNA in a large number of individuals. Since the regions of DNA under investigation are often related to health or disease, the NIH has proposed that these assays be called "Medical Sequencing." The suggested change also serves to avoid giving the public the impression that resequencing is being carried out to correct mistakes.

Unlike many assay systems (hybridization, enzyme activity, protein binding ...) where an event or complex interaction is measured and described by a single data value, a quantitative assay based on DNA sequences yields a greater variety of information. In a technique analogous to using an EST library, an RNA library can be sequenced, and the expression of many genes can be measured at once, by counting the number of samples that align to a given position or reference. If the library is prepared from DNA, a count of the aligned reads could measure the copy number of a gene. The composition of the read data itself can be informative. Mismatches in aligned reads can help discern alleles of a gene, or members of a gene family. In a variation assay, reads can both assess the frequency of a SNP and discover new variation. DNA sequences could be used in quantitative assays to some extent with Sanger sequencing, but the cost and labor requirements prevented wide spread adoption.

Next Gen adds a global perspective and new challenges

The power of Next Gen experiments comes from sequencing DNA libraries in a massively parallel fashion. Traditionally, a DNA library was used to clone genes. The library was prepared by isolating and fragmenting genomic DNA, ligating the pieces to a plasmid vector, transforming bacteria with the ligation products, and growing colonies of bacteria on plates with antibiotics. The plasmid vector would allow a transformed bacterial cell to grow in the presence of an antibiotic so that transformed cells could be separated from other cells. The transformed cells would then be screened for the presence of a DNA insert or gene of interest through additional selection, colorimetric assay (e.g. blue / white), or blotting. Over time, these basic procedures were refined and scaled up in factory style production to enable high throughput shotgun sequencing and EST sequencing. A significant effort and cost in Sanger sequencing came from the work needed to prepare and track large numbers of clones, or PCR-products, for data linking and later retrieval to close gaps or confirm results.

In Next Gen sequencing, DNA libraries are prepared, but the DNA is not cloned. Instead other techniques are used to "separate," amplify, and sequence individual molecules. The molecules are then sequenced all at once, in parallel, to yield large global data sets in which each read represents a sequence from an individual molecule. The frequency of occurrence of a read in the population of reads can now be used to measure the concentration of individual DNA molecules. Sequencing DNA libraries in this fashion significantly lowers costs, and makes previously cost prohibitive experiments possible. It also changes how we need to think about and perform our experiments.

The first change is that preparing the DNA library is the experiment. Tag profiling, RNA-seq, small RNA, ChIP-seq, DNAse hypersensitivity, methylation, and other assays all have specific ways in which DNA libraries are prepared. Starting materials and fragmentation methods define the experiment and how the resulting datasets will be analyzed and interpreted. The second change is that large numbers of clones no longer need to be prepared, tracked, and stored. This reduces the number of people needed to process samples, and reduces the need for robotics, large number of thermocyclers, and other laboratory equipment. Work that used to require a factory setting can now be done in a single laboratory, or mailroom if you believe the ads.

Attention to details counts

Even though Next Gen sequencing gives us the technical capabilities to ask detailed and quantitative questions about gene structure and expression, successful experiments demand that we pay close attention to the details. Obtaining data that are free of confounding artifacts and accurately represent the molecules in a sample, demands good technique and a focus on detail. DNA libraries no longer involve cloning, but their preparation does require multiple steps performed over multiple days. During this process, different kinds of data ranging from gel images to discrete data values, may be collected and used later for trouble shooting. Tracking the experimental details requires that a system be in place that can be configured to collect information from any number and kind of process. The system also needs to be able to link data to the samples, and convert the information from millions of sequence data points to tables, graphics and other representations that match the context of the experiment and give a global view of how things are working. FinchLab is that kind of system.

Friday, August 8, 2008

ChIP-ing Away at Analysis

ChiP-Seq is becoming a popular way to study the interactions between proteins and DNA. This new technology is made possible by the Next Gen sequencing techniques and sophisticated tools for data management and analysis. Next Gen DNA sequencing provides the power to collect the large amounts of data required. FinchLab is the software system that is needed to track the lab steps, initiate analysis, and see your results.

In recent posts, we stressed the point that unlike Sanger sequencing, Next Gen sequencing demands that data collection and analysis be tightly coupled, and presented our initial approach of analyzing Next Gen data with the Maq program. We also discussed how the different steps (basecalling, alignment, statistical analysis) provide a framework for analyzing Next Gen data and described how these steps belong to three phases: primary, secondary, and tertiary data analysis. Last, we gave an example of how FinchLab can be used to characterize data sets for Tag Profiling experiments. This post expands the discussion to include characterization of data sets for ChIP-Seq.

ChIP-Seq

ChiP (Chromosome Immunoprecipitation) is a technique where DNA binding proteins, like transcription factors, can be localized to regions of a DNA molecule. We can use this method to identify which DNA sequences control expression and regulation for diverse genes. In the ChIP procedure, cells are treated with a reversible cross-linking agent to "fix" proteins to other proteins that are nearby, as well as the chromosomal DNA where they're bound. The DNA is then purified and broken into smaller chunks by digestion or shearing and antibodies are used to precipitate any protein-DNA complexes that contain their target antigen. After the immunoprecipitation step, unbound DNA fragments are washed away, the bound DNA fragments are released, and their sequences are analyzed to determine the DNA sequences that the proteins were bound to. Only few years ago, this procedure was much more complicated than it is today, for example, the fragments had to be cloned before they could be sequenced. When microarrays became available, a microarray-based technique called ChIP-on-chip made this assay more efficient by allowing a large number of precipitated DNA fragments to be tested in fewer steps.

Now, Next Gen sequencing takes ChIP assays to a new level [1]. In ChIP-seq the same cross linking, isolation, immunoprecipitation, and DNA purification steps are carried out. However, instead of hybridizing the resulting DNA fragments to a DNA array, the last step involves adding adaptors and sequencing the individual DNA fragments in parallel. When compared to microarrays, ChiP-seq experiments are less expensive, require fewer hands-on steps and benefit from the lack of hybridization artifacts that plague microarrays. Further, because ChIP-seq experiments produce sequence data, they allow researchers to interrogate the entire chromosome. The experimental results are no longer to the probes on the micoarray. ChIP-Seq data are better at distinguishing similar sites and collecting information about point mutations that may give insights into gene expression. No wonder ChIP-Seq is growing in popularity.

FinchLab

To perform a ChIP-seq experiment, you need to have a Next Gen sequencing instrument. You will also need to have the ability to run an alignment program and work with the resulting data to get your results. This is easier said than done. Once the alignment program runs, you might have to also run additional programs and scripts to translate raw output files to meaningful information. The FinchLab ChIP-seq pipeline, for example, runs Maq to generate the initial output, then runs Maq pileup to convert the data to a pileup file. The pileup file is then read by a script to create the HTML report, thumbnail images to see what is happening and "wig" files that can be viewed in the UCSC Genome Browser. If you do this yourself, you have to learn the nuances of the alignment program, how to run it different ways to create the data sets, and write the scripts to create the HTML reports, graphs, and wig files.

With FinchLab, you can skip those steps. You get the same results by clicking a few links to sort the data, and a few more to select the files, run the pipeline, and view the summarized results. You can also click a single link to send the data to the UCSC genome browser for further exploration.

Reference

ChIP-seq: welcome to the new frontier Nature Methods - 4, 613 - 614 (2007)

Wednesday, June 25, 2008

Finch 3: Getting Information Out of Your Data

Geospiza's tag line "From Sample to Results" represents the importance of capturing information from all steps in the laboratory process. Data volumes are important and lots of time is being spent discussing the overwhelming volumes of data produced by new data collection technologies like Next Gen sequencers. However, the real issue is not how you are going to store the data, rather it is what are you going to do with it? What do your data mean in the context of your experiment?

The Geospiza FinchLab software system supports the entire laboratory and data analysis workflow to convert sample information into results. What this means is that the system provides a complete set of web-based interfaces and an underlying database to enter information about samples and experiments, track sample preparation steps in the laboratory, link the resulting data back to samples, and process the data to get biological information. Previous posts have focused on information entry, laboratory workflows, and data linking. This post will focus on how data are processed to get biological information.

The ultra-high data output of Next Gen sequencers allows us to use DNA sequencing to ask many new kinds of questions about structural and nucleotide variation and measure several indicators of expression and transcription control on a genome-wide scale. The data produced consists of images, signal intensity data, quality information, and DNA sequences and quality values. For each data collection run, the total collection of data and files can be enormous and can require significant computing resources. While all of the data have to be dealt with in some fashion, some of the data have long-term value while other data are only needed in the short term. The final scientific results will often be produced by comparing data sets created from the DNA sequences and their comparison to reference data.

Next Gen data are processed in three phases.

Next Gen data workflows involve three distinct phases of work: 1. Data are collected from control and experimental samples. 2. Sequence data obtained from each sample are aligned to reference sequence data, or data sets to produce aligned data sets 3. Summaries of the alignment information from the aligned data sets are compared to produce scientific understanding. Each phase has a discrete analytical process and we, and others, call these phases primary data analysis, secondary data analysis and tertiary data analysis.

Primary data analysis involves converting image data to sequence data. The sequence data can be in familiar "ACTG" sequence space or less familiar color space (SOLiD) or flow space (454). Primary data analysis is commonly performed by software provided by the data collection instrument vendor and it is the first place where quality assessment about a sequencing run takes place.

Secondary data analysis creates the data sets that will be further used to develop scientific information. This step involves aligning the sequences from the primary data analyses to reference data. Reference data can be complete genomes, subsets of genomic data like expressed genes, or individual chromosomes. Reference data are chosen in an application specific manner and sometimes multiple reference data sets will be used in an iterative fashion.

Secondary data analysis has two objectives. The first is to determine the quality of the DNA library that was sequenced, from a biological and sample perspective. The primary data analysis supplies quality measurements that can used to determine if the instrument ran properly, or whether the density of beads or clusters were at their optimum to deliver the highest number of high quality reads. However, those data do not tell you about the quality of the samples. Answering questions about sample quality, such as did the DNA library contain systematic artifacts such as sequence bias? Were there high numbers of ligated adaptors or incomplete restriction enzyme digests, or any other factors that would interfere with interpreting the data? These kinds of questions are addressed in the secondary data analysis by aligning your reads to the reference data and seeing that your data make sense.

The second objective of secondary data analysis is to prepare the data sets for tertiary analysis where they will be compared in an experimental fashion. This step involves further manipulation of alignments, typically expressed in very large hard to read algorithm specific tables, to produce data tables that can be consumed by additional software. Speaking of algorithms, there is a large and growing list to choose from. Some are general purpose and others are specific to particular applications, we'll comment more on that later.

Tertiary data analysis represents the third phase of the Next Gen workflow. This phase may involve a simple activity like viewing a data set in a tool like a genome browser so that the frequency of tags can be used to identify promoter sites, patterns of variation, or structural differences. In other experiments, like digital gene expression, tertiary analysis can involve comparing different data sets in a similar fashion to microarray experiments. These kinds of analyses are the most complex; expression measurements need to be normalized between data sets and statistical comparisons need to be made to assess differences.

To summarize, the goal of primary and secondary analysis is to produce well-characterized data sets that can be further compared to obtain scientific results. Well-characterized means that the quality is good for both the run and the samples and that any biologically relevant artifacts are identified, limited, and understood. The workflows for these analyses involve many steps, multiple scientific algorithms, and numerous file formats. The choices of algorithms, data files, data file formats, and overall number of steps depend the kinds of experiments and assays being performed. Despite this complexity there are standard ways to work with Next Gen systems to understand what you have before progressing through each phase.

The Geospiza FinchLab system focuses on helping you with both primary and secondary data analysis.

Monday, May 26, 2008

Finch 3: Managing Workflows

Genetic analysis workflows begin with RNA or DNA samples and end with results. In between, multiple lab procedures and steps are used to transform materials, move samples between containers, and collect the data. Each kind of data collected and each data collection platform requires that different laboratory procedures are followed. When we analyze the procedures, we can identify common elements. A large number of unique workflows can be created by assembling these elements in different ways.

In the last post, we learned about the FinchLab order form builder and some of its features for developing different kinds of interfaces for entering sample information. Three factors contribute to the power of Finch orders. First, labs can create unique entry forms by selecting items like pull down menus, check boxes, radio buttons, and text entry fields for numbers or text, from a web page. No programming is needed. Second, for core labs with business needs, the form fields can be linked to diverse price lists. Third, the subject of this post, is that the forms are also linked to different kinds of workflows.

What are Workflows?

A workflow is a series of series of steps that must be performed to complete a task. In genetic analysis, there are two kinds of workflows: those that involve laboratory work, and those that involve data processing and analysis. The laboratory workflows prepare sample materials so that data can be collected. For example, in gene expression studies, RNA is extracted from a source material (cells, tissue, bacteria), and converted to cDNA for sequencing. The workflow steps may involve purification, quality analysis on agarose gels, concentration measurements, and reactions where materials are further prepared for additional steps.

The data workflows encompass all the steps involved in tracking, processing, managing, and analyzing data. Sequence data are processed by programs to create assemblies and alignments that are edited or interrogated to create genomic sequences, discover variation, understand gene expression, or perform other activities. Other kinds of data workflows such as microarray analysis, or genotyping involve developing and comparing data sets to gain insights. Data workflows involve file manipulations, program control, and databases. The challenge for the scientist today, and the focus of Geospiza's software development is to bring the laboratory and data workflows together.

Workflow Systems

Workflows can be managed or unmanaged. Whether you work at the bench or work with files and software, you use a workflow any time you carry out a procedure with more than one step. Perhaps you wite the steps in your notebook, check them off as you go, and tape in additional data like spectrophotometer readings or photos. Perhaps you write papers in Word and format the bibliography with Endnote or resize photos with Photoshop before adding them to a blog post. In all these cases you performed unmanaged workflows.

Managing and tracking workflows becomes important as the number of activities and number of individuals performing them increase in scale. Imagine your lab bench procedures performed multiple times a day with different individuals operating particular steps. This scenario occurs in core labs that perform the same set of processes over and over again. You can still track steps on paper, but it's not long before the system becomes difficult to manage. It takes too much time to write and compile all of the notes, and it's hard to know which materials have reached which step. Once a system goes beyond the work of a single person, paper notes quit providing the right kinds of overviews. You now need to manage your workflows and track them with a software system.

A good workflow system allows you to define the steps in your protocols. It will provide interfaces to move samples through the steps and also provide ways to add information to the system as steps are completed. If the system is well-designed, it will not allow you do things at inappropriate times or require too much "thinking" as the system is operated. A well-designed system will also reduce complexity and allow you to build workflows through software interfaces. Good systems give scientists the ability to manage their work, they do not require their users to learn arcane programming tools or resort to custom programming. Finally, the system will be flexible enough to let you create as many workflows as you need for different kinds of experiments and link those workflows to data entry forms so that the right kind of information is available to right process.

FinchLab Workflows

The Geospiza FinchLab workflow system meets the above requirements. The system has a high level workflow that understands that some processes require little tracking (a quick test) and other's require more significant tracking ("I want to store and reuse DNA samples"). More detailed processes are assigned workflows that consist of thee parts: A name, a "State," and a

"Status." The "State" controls the software interfaces and determines which information are presented and accessed at different parts of a process. A sequencing or genotyping reaction, for example, cannot be added to a data collection instrument until it is "ready." The other part specifies the steps of the process. The steps of the process (Statuses) are defined by the lab and added to a workflow using the web interfaces. When a workflow is created, it is given a name, as many steps as needed, and it is assigned a State. The workflows are then assigned to different kinds of items so that the system always knows what to do next with the samples that enter.

A workflow management system like FinchLab makes it just as easy to track the steps of Sanger DNA sequencing, as it is to track the steps of a Solexa, SOLiD, or 454 sequencing processes. You can also, in the same system, run genotyping assays and other kinds of genetic analysis like microarrays and bead assays.

Next time, we'll talk about what happens in the lab.

Tuesday, May 20, 2008

Finch 3: Defining the Experimental Information

In today's genetic analysis laboratory, multiple instruments are used to collect a variety of data ranging from DNA sequences to individual values that measure DNA (or RNA) hybridization, nucleotide incorporations, or other binding events. Next Gen sequencing adds to this complexity and offers additional challenges with the amount of data that can be produced for a given experiment.

In the last post, I defined basic requirements for a complete laboratory and data management system in the context of setting up a Next Gen sequencing lab. To review, I stated that laboratory workflow systems need to perform the following basic functions:

Allow you set up different interfaces to collect experimental information
Assign specific workflows to experiments
Track the workflow steps in the laboratory
Prepare samples for data collection runs
Link data from the runs back to the original samples
Process data according to the needs of the experiment

I also added that if you operate a core lab, you'll want to bill for your services and get paid.

In this post I'm going to focus on the first step, collecting experimental information. For this exercise let's say we work in a lab that has:

One Illumina Solexa Genome Analyzer
One Applied Biosystems SOLiD System
One Illumina Bead Array station
Two Applied Biosystems 3730 Genetic Analyzers, used for both sequencing and fragment analysis

This image shows our laboratory home page. We run our lab as a service lab. For each data collection platform we need to collect different kinds of sample information. One kind of information is the sample container. Our customer's samples will be sent the lab in many different kinds of containers depending on the kind of experiment. Next Gen sequencing platforms like SOLiD, Solexa, and 454 are low throughput with respect to sample preparation, so samples will be sent to us in tubes. Instruments like the Bead Array and 3730 DNA sequencing instrument, usually involve sets of samples in 96 or 384 well plates. In some cases, samples start in tubes and end up in plates, so you'll need to determine which procedures use tubes and which use plates and how the samples will enter the lab.

Once the samples have reached the lab, and been checked, you are also going to do different things to the samples in order to prepare them for the different data collection platforms. You'll want to know which samples should go to what platforms and have the workflows for different processes defined so that they are easy to follow and track. You might even want to track and reuse certain custom reagents like DNA primers, probes and reagent kits. In some cases you'll want to know physical information, like DNA, RNA, or concentration, upfront. In other cases you'll determine information later.

Finally, let's say you work at an institution that focuses on a specific area of research, like cancer, or mouse genetics, or plant research. In these settings you might want to also track information about sample source. Such information could include species, strain, tissue, treatment or many other kinds of things. If you want to explore this information later you'll probably want to define a vocabulary that can be "read" by computer programs. To ensure that the vocabulary can be followed, interfaces will be needed to enter this information without typing or else you'll have a problem like pseudomonas, psuedomonas, or psudomonas.

Information systems that support the above scenarios have to deal with a lot of "sometimes this" and "sometimes that" kinds of information. If one path is taken, Sanger sequencing on a 3730, different sample information and physical configurations are needed than we need with Next Gen sequencing. Next Gen platforms have different sample requirements too. SOLiD and 454 require emulsion PCR to prepare sequencing samples, whereas Solexa, amplifies DNA molecules on slides in clusters. Additionally, the information entry system also has deal with "I care" and "I don't care" kinds of data like information about sample sources, or experimental conditions. These kinds of information are needed later to understand the data in the context of the experiment, but do not have much impact on the data collection processes.

How would you create a system to support these diverse and changing requirements?

One way to do this would be to build a form with many fields and rules for filling it out. You know those kinds of forms. They say things like "ignore this section if you've filled out this other section." That would be a bad way to do this, because no one would really get things right, and the people tasked with doing the work would spend a lot of time either asking questions about what they are supposed to be doing with the samples or answering questions about how to fill out the form.

Another way would be to tell people that their work is too complex and they need custom solutions for everything they do. That's expensive.

A better way to do this would be to build a system for creating forms. In this system, different forms are created by the people who develop the different services. The forms are linked to workflows (lab procedures) that can understand sample configurations (plates, tubes, premixed ingredients, and required information). If the systems is really good, you can easily create new forms and add fields to them to collect physical information (sample type, concentration) or experimental information (tissue, species, strain, treatment, your mothers maiden name, favorite vacation spot, ...) without having to develop requirements with programmers and have them build forms. If your system is exceptionally good, smart, and clever it will let you create different kinds of forms and fields and prevent you from doing things that are in direct conflict with one another. If your system is modern, it will be 100% web-based and have cool web 2.0 features like automated fill downs, column highlighting, and multi-selection devices so that entering data is easy, intuitive, and even a bit fun.

FinchLab, built on the Finch 3 platform, is such a system.

Wednesday, April 16, 2008

Expectations Set the Rules

Genetic analysis workflows are complex. Biology is non-deterministic, so we continually experience new problems. Lab processes and our data have natural uncertainty. These factors conspire against us to make our world rich in variability and processes less than perfect.

That keeps things interesting.

In a previous post, I was able to show how sequence quality values could be used to summarize the data for a large resequencing assay. Presenting "per read" quality values in a grid format allowed us to visualize samples that had failed as well as observe that some amplicons contained repeats that led to sequencing artifacts. We also were able to identify potential sample tracking issues and left off with an assignment to think about how we might further test sample tracking in the assay.

When an assay is developed there are often certain results that can be expected. Some results are defined explicitly with positive and negative controls. We can also use assay results to test that the assay is producing the right kinds of information. Do the data make sense? Expectations can be derived from the literature, an understanding of statistical outcomes, or internal measures.

Genetic assays have common parts

A typical genetic resequencing assay is developed from known information. The goal is to collect sequences from a defined region of DNA for a population of individuals (samples) and use the resulting data to observe the frequency of known differences (variants) and identify new patterns of variation. Each assay has three common parts:

Gene Model - Resequencing and genotyping projects involve comparative analysis of new data (sequences, genotypes) to reference data. The Gene Model can be a chromosomal region or specific gene. A well-developed model will include all known genotypes, protein variations, and phenotypes. The Gene Model represents both community (global) and laboratory (local) knowledge.

Assay Design - The Assay Design defines the regions in the Gene Model that will be analyzed. These regions, typically prepared by PCR are bounded by unique DNA primer sequences. The PCR primers have two parts: one part is complementary to the reference sequence (black in the figure), the other part is "universal" and is complementary to a sequencing primer (red in the figure). The study includes information about patient samples such as their ethnicity, collection origin, and phenotypes associated with the gene(s) under study.

Experiments / Data Collection / Analysis - Once the study is designed and materials arrive, samples are prepared for analysis. PCR is used to amplify specific regions for sequencing or genotyping. After a scientist is confident that materials will yield results, data collection begins. Data can be collected in the lab or the lab can outsource their sequencing to core labs or service companies. When data are obtained, they are processed, validated, and compared to reference data.

Setting expectations

A major challenge for scientists doing resequencing and genotyping projects arises when trying to evaluate data quality and determine the “next steps.” Rarely does everything work. We've already talked about read quality, but there are also the questions of whether the data are mapping to their expected locations, and whether the frequencies of observed variation are expected. The Assay Design can be used to verify experimental data.

The Assay Design tells us where the data should align and how much variation can be expected. For example, if the average SNP frequency is 1/1300 bases, and an average amplicon length is 580 bases, we should expect to observe one SNP for every two amplicons. Furthermore, in reads where a SNP may be observed, we will see the difference in a subset of the data because some, or most, of the reads will have the same allele as the reference sequence.

To test our expectations for the assay, the 7488 read data set is summarized in a way that counts the frequency of disagreements between read data and their reference sequence. The graph below shows a composite of read discrepancies (blue bar graph) and average Q20/rL, Q30/rL, Q40/rL values (colored line graphs). Reads are grouped according to the number of discrepancies observed (x-axis). For each group, the count of reads (bar height) and average Q20/rL (green triangles), Q30/rL (yellow squares), and Q40/rL (purple circles) are displayed.

In the 7488 read data set, 95% (6914) of the reads gave alignments. Of the aligned data, 82% of the reads had between 0 and 4 discrepancies. If we were to pick which traces to review and which to samples to redo, we would likely focus our review on the data in this group and queue the rest (18%) for redos to see if we could improve the data quality.

Per our previous prediction, most of the data (5692 reads) do not have any discrepancies. We also observe that the number of discrepancies increases as the overall data quality decreases. This is expected because the quality values are reflecting the uncertainty (error) in the data.

Spotting tracking issues

We can also use our expectations to identify sample tracking issues. Once an assay is defined, the positions of all of the PCR primers are known, hence we should expect that our sequence data will align to the reference sequence in known positions. In our data set, this is mostly true. Similar to the previous quality plots of samples and amplicons, an alignment "quality" can be defined and displayed in a table where the rows are samples and columns are amplicons. Each sample has two rows (one forward and one reverse sequence). If the cells are colored according to alignment start positions (green for expected, red for unexpected, white for no alignment) we can easily spot which reads have an "unexpected" alignment. The question then becomes, where/when did the mixup occur?

From these kinds of analyses we can get a feel for whether a project is on track and whether there are major issues that will make our lives harder. In future posts I will comment on other kinds of measures that can be made and show you how this work can be automated in FinchLab.

Tuesday, April 8, 2008

Exceptions are the Rule

Genetic analysis workflows are complex. You can expect that things will go wrong in the laboratory. Biology also manages to interfere and make things harder than you think they should be. Your workflow management system needs to show the relevant data, allow you to observe trends, and have flexible points were procedures can be repeated.

In the last few posts, I introduced genetic analysis workflows, concepts about lab and data workflows, and discussed why it is important to link the lab and data workflows. In this post I expand on the theme and show how a workflow system like the Geospiza FinchLab can be used to troubleshoot laboratory processes.

First, I'll review our figure from last week. Recall that it summarized 4608 paired forward / reverse sequence reads. Samples are represented by rows, and amplicons by column, so that each cell represents a single read from a sample and one of its amplicons. Color is used to indicate quality, with different colors showing the the number of Q20 bases divided by the read length (Q20/rL). Green is used for values between 0.60 and 1.00, blue for values between 0.30 and 0.59, and red for values less than 0.29. The summary showed patterns that, indicated lab failures and biological issues. You were asked to figure them out. Eric from seqanswers (a cool site for Next Gen info) took a stab at this, and got part of the puzzle solved.

Sample issues

Rows 1,2 and 7,8 show failed samples. We can spot this because of the red color across all the amplicons. Either the DNA preps failed to produce DNA, or something interfered with the PCR. Of course there are those pesky working reactions for both forward and reverse sequence in sample 1 column 8. My first impression is that there is a tracking issue. The sixth column also has one reaction that worked. Could this result indicate a more serious problem in sample tracking?

Amplicon issues

In addition to the red rows, some columns show lots of blue spots; these columns correspond to amplicons 7, 24 and 27. Remember that blue is an intermediate quality. An intermediate quality could be obtained if part of the sequence is good and part of the sequence is bad. Because the columns represent amplicons, when we see a pattern in a column it likely indicates s systematic issue for that amplicon. For example, in column 7, all of the data are intermediate quality. Columns 24 and 27 are more interesting because the striping pattern indicates that one sequencing reaction results in data with intermediate quality while the other looks good. Wouldn't it be great if we could drill down from this pattern and see a table of quality plots and also get to the sequence traces?

Getting to the bottom

In FinchLab we can drill down and view the underlying data. The figure below summarizes the data for amplicon 24. The panel on the left is the expanded heat map for the data set. The panel on right is a folder report summarizing the data from 192 reads for amplicon 24. It contains three parts: An information table that provides an overview of the details for the reads. A histogram plot that counts how many reads have a certain range of Q20 values, and a data table that summarizes each read in a row containing its name, the number of edit revisions, its Q20, Q20/rLen values, and a thumbnail quality plot showing the quality values for each base in the read. In the histogram, you can see that two distinct peaks are observed. About half the data have low Q20 values and half have high Q20 values, producing the striping pattern in the heat map. The data table shows two reads; one is the forward sequence and the other is its "reverse" pair. These data were brought together using the table's search function, in the "finder" bar. Note how the reads could fit together if one picture was reversed.

Could something in the sequence be interfering with the sequencing reaction?

To explore the data further, we need to look at the sequences themselves. We can do this by clicking the name and viewing the trace data online in our web browser, or we could click the FinchTV icon and view the sequence in FinchTV (bottom panel of the figure above). When we do this for the top read (left most trace) we see that, sure enough, there is a polyT track that we are not getting through. During PCR such regions can cause "drop outs" and result in mixtures of molecules that differ in size by one or two bases. A hallmark of such a problem is a sudden drop in data quality at the end of the poly nucleotide track because the mixture of molecules creates a mess of mixed bases. This explanation confirmed by the other read. When we view it in FinchTV (right most trace) we see poor data at the end of the read. Remember these data are reversed relative to the first read so when we reverse complement the trace (middle trace), we see that it "fits" together with the first read. A problem for such amplicons is that we now have only single stranded coverage. Since this problem occurred at the end of the read, half of the data are good and the other half are poor quality. If the problem occurred in the middle of the read, all of the data would show an intermediate quality like amplicon 7.

In genetic analysis data quality values are an important tool for assessing many lab and sample parameters. In this example, we were able to see systematic sample failures and sequence characteristics that can lead to intermediate quality data. We can use this information to learn about biological issues that interfere with analysis. But what about our potential tracking issue?

How might we determine if our samples are being properly tracked?

Friday, April 4, 2008

Lab work without data analysis and management is doo doo

As we begin to contemplate next generation sequence data management, we can use Sanger sequencing to teach us important lessons. One of which, is the value of linking laboratory and data workflows to be able to view information in the context of our assays and experiments.

I have been fortunate to hear J. Michael Bishop speak on a couple of occasions. He ended these talks by quoting one of his biochemistry mentors, "genetics without biochemistry is doo doo." In a similar vein, lab work without data analysis and management is doo doo. That is when you separate the lab from the data analysis, you have to work through a lot of doo to figure things out. Without a systematic way to view summaries of large data sets, the doo is overwhelming.

To illustrate, I am going to share some details about a resequencing project we collaborated on. We came to this project late, so much of the data had been collected, and there were problems, lots of doo. Using Finch however, we could quickly organize and analyze the data, and present information in summaries with drill downs to the details to help troubleshoot and explain observations that were seen in the lab.

10,686 sequence reads: forward / reverse sequences from 39 amplicons from 137 individuals

The question being asking in this project was: are there new variants in a gene that are related to phenotypes observed in a specialized population? This is the kind of question medical researchers ask frequently. Typically they have a unique collection of samples that come from a well understood population of individuals. Resequencing is used to interrogate the samples for rare variants, or genotypes.

In this process, we purify DNA from sample material (blood), and use PCR with exon specific probes to amplify small regions of DNA within the gene. The PCR primers have regions called universal adaptors. Our sequencing primers will bind to those regions. Each PCR product, called an amplicon, is sequenced twice, once from each strand to give double coverage of the bases.

When we do the math, we will have to track the DNA for 137 samples and 5343 amplicons. Each amplicon is sequenced, at a minimum twice, to give us 10,686 reads. From a physical materials point of view that means 137 tubes with sample; 56, 96-well plates for PCR; and 112, 96-well plates for sequencing. In a 384-well format we could have used 14 plates for PCR and 28 plates for sequencing. For a genome center, this level of work is trivial, but for a small lab this is significant work and things can happen. Indeed as not all the work is done in a single lab the process can be more complex. And you need to think about how you would lay this out - 96 does not divide by 39 very well.

From a data perspective, we can use sequence quality values to identify potential laboratory and biological issues. The figure below summarizes 4608 reads. Each pair of rows is one sample (forward / reverse sequence pairs, alternating gray and white - 48 total). Each column is an amplicon. Each cell in the table represents a single read from an amplicon and sample. Color is used to indicate quality. In this analysis, quality is defined as the ratio of Q20 to read length (Q20/rL), which works very well for PCR amplicons. The better the data, the closer this ratio is to one. In the table below, green indicates Q20/rL values between 0.60 and 1.00, blue indicates values between 0.30 and 0.59, and red indicates Q20/rL values less than 0.29. The summary shows patterns that, as we will learn next week, show lab failures and biological issues. See if you can figure them out.

Wednesday, April 2, 2008

Working with Workflows

Genetic analysis workflows involve both complex laboratory and data analysis and manipulation procedures. A good workflow management system not only tracks processes, but simplifies the work.

In my last post , I introduced the concept of workflows in describing the issues one needs to think about as they prepare their lab for Next Gen sequencing. To better understand these challenges, we can learn from previous experience with Sanger sequencing in particular and genetic assays in general.

As we know, DNA sequencing serves many purposes. New genomes and genes in the environment are characterized and identified by De Novo sequencing. Gene expression can be assessed by measuring Expressed Sequence Tags (ESTs), and DNA variation and structure can be investigated by resequencing regions of known genomes. We also know that gene expression and genetic variation can also be studied with multiple technologies such as hybridization, fragment analysis, and direct genotyping and it is desirable to use multiple methods to confirm results. Within each of these general applications and technology platforms, specific laboratory and bioinformatics workflows are used to prepare samples, determine data quality, study biology, and predict biological outcomes.

The process begins in the laboratory.

Recently I came across a Wikipedia article on DNA sequencing that had a simple diagram showing the flow of materials from samples to data. I liked this diagram, so I reproduced it, with modifications. We begin with the sample. A sample is a general term that describes a biological material. Sometimes, like when you are at the doctor, these are called specimens. Since biology is all around and in us, samples come from anything that we can extract DNA or RNA from. Blood, organ tissue, hair, leaves, bananas, oysters, cultured cells, feces, you-can-image-what-else, can all be samples for genetic analysis. I know a guy who uses a 22 to collect the apical meristems from trees to study poplar genetics. Samples come from anywhere.

With our samples in hand, we can perform genetic analyses. What we do next depends on what we want to learn. If we want to sequence a genome we're going to prepare a DNA library by randomly shearing the genomic DNA and cloning the fragments into sequencing vectors. The purified cloned DNA templates are sequenced and the data we obtain are assembled into larger sequences (contigs) until, hopefully, we have a complete genome. In resequencing and other genetic assays, DNA templates are prepared from sample DNA by amplifying specific regions of a genome with PCR. The PCR products, amplicons, are sequenced and the resulting data are compared to a reference sequence to identify differences. Gene expression (EST and hybridization) analysis follows similar patterns except that RNA is purified from samples and then converted to cDNA using RT-PCR (Reverse Transcriptase PCR, not Real Time PCR - that's a genetic assay).

From a workflow point of view, we can see how the physical materials change throughout the process. Sample material is converted to DNA or RNA (nucleic acids), and the nucleic acids are further manipulated to create templates that are used for the analytical reaction (DNA sequencing, fragment analysis, RealTime-PCR, ...). As the materials flow through the lab, they're manipulated in a variety of containers. A process may begin with a sample in a tube, use a petri plate to isolate bacterial colonies, 96-well plates to purify DNA and perform reactions, and 384-well plates to collect sequence data. The movement of the materials must be tracked, along with their hierarchical relationships. A sample may have many templates that are analyzed, or a template may have multiple analyses. When we do this a lot we need a way to see where our samples are in their particular processes. We need a workflow management system, like FinchLab.