Tuesday, August 26, 2008

Maq in the Literature

Kudos to Heng Li and team at the Sanger Center. Today Genome Research published their paper on Maq. Maq (Mapping and Assembly with Quality) is an algorithm, developed at the Sanger Center, for assembling Next Gen reads to a reference sequence. MassGenomics sums up why they like Maq and we could not agree more. I also agree that Maq is better name name than mapASS.

One of the things we like best is how versatile the program is for Next Gen applications. Whether you are performing Tag Profiling, ChIP-Seq, RNA-Seq (transcriptome analysis) resequencing, or other applications, its output contains a wide variety of useful information as we will show in coming posts. If you want to know right now, give us a call and we'll show you why Geospiza, Sanger, Washington University and many others think Maq is a great place to start working with Next Gen data.

Wednesday, August 20, 2008

Next Gen DNA Sequencing Is Not Sequencing DNA

In the old days, we used DNA sequencing primarily to learn about the sequence and structure of a cloned gene. As the technology and throughput improved, DNA sequencing became a tool for investigating entire genomes. Today, with the exception of de novo sequencing, Next Gen sequencing has changed the way we use DNA sequences. We're no longer looking for new DNA sequences. We're using Next Gen technologies to perform quantitative assays with DNA sequences as the data points. This is a different way of thinking about the data and it impacts how we think about our experiments, data analysis, and IT systems.

In de novo sequencing, the DNA sequence of a new genome, or genes from the environment is elucidated. De novo sequencing ventures into the unknown. Each new genome brings new challenges with respect to interspersed repeats, large segmented gene duplications, polyploidy and interchromosomal variation. The high redundancy samples obtained from Next Gen technology lower the cost and speed this process because less time is required for getting additional data to fill in gaps and finish the work.

The other ultra high throughput DNA sequencing applications, on the other hand, focus on collecting sequences from DNA or RNA molecules for which we already have genomic data. Generally called "resequencing," these applications involve collecting and aligning sequence reads to genomic reference data. Experimental information is obtained by tabulating the frequency, positional information, and variation of the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw conclusions.

DNA sequences are information rich data points

EST (expressed sequence tag) sequencing was one of the first applications to use sequence data in a quantitative way. In EST applications, mRNA from cells was isolated, converted to cDNA, cloned, and sequenced. The data from an EST library provided both new and quantitative information. Because each read came from a single molecule of mRNA, a set of ESTs could be assembled and counted to learn about gene expression. The composition and number of distinct mRNAs from different kinds of tissues could be compared and used to identify genes that were expressed at different time points during development, in different tissues, and in different disease states, such as cancer. The term "tag" was invented to indicate that ESTs could also be used to identify the genomic location of mRNA molecules. Although the information from EST libraries was been informative, lower cost methods such as microarray hybridization and real time-PCR assays replaced EST sequencing over time, as more genomic information became available.

Another quantitative use of sequencing has been to assess allele frequency and identify new variants. These assays are commonly known as "resequencing" since they involve sequencing a known region of genomic DNA in a large number of individuals. Since the regions of DNA under investigation are often related to health or disease, the NIH has proposed that these assays be called "Medical Sequencing." The suggested change also serves to avoid giving the public the impression that resequencing is being carried out to correct mistakes.

Unlike many assay systems (hybridization, enzyme activity, protein binding ...) where an event or complex interaction is measured and described by a single data value, a quantitative assay based on DNA sequences yields a greater variety of information. In a technique analogous to using an EST library, an RNA library can be sequenced, and the expression of many genes can be measured at once, by counting the number of samples that align to a given position or reference. If the library is prepared from DNA, a count of the aligned reads could measure the copy number of a gene. The composition of the read data itself can be informative. Mismatches in aligned reads can help discern alleles of a gene, or members of a gene family. In a variation assay, reads can both assess the frequency of a SNP and discover new variation. DNA sequences could be used in quantitative assays to some extent with Sanger sequencing, but the cost and labor requirements prevented wide spread adoption.

Next Gen adds a global perspective and new challenges

The power of Next Gen experiments comes from sequencing DNA libraries in a massively parallel fashion. Traditionally, a DNA library was used to clone genes. The library was prepared by isolating and fragmenting genomic DNA, ligating the pieces to a plasmid vector, transforming bacteria with the ligation products, and growing colonies of bacteria on plates with antibiotics. The plasmid vector would allow a transformed bacterial cell to grow in the presence of an antibiotic so that transformed cells could be separated from other cells. The transformed cells would then be screened for the presence of a DNA insert or gene of interest through additional selection, colorimetric assay (e.g. blue / white), or blotting. Over time, these basic procedures were refined and scaled up in factory style production to enable high throughput shotgun sequencing and EST sequencing. A significant effort and cost in Sanger sequencing came from the work needed to prepare and track large numbers of clones, or PCR-products, for data linking and later retrieval to close gaps or confirm results.

In Next Gen sequencing, DNA libraries are prepared, but the DNA is not cloned. Instead other techniques are used to "separate," amplify, and sequence individual molecules. The molecules are then sequenced all at once, in parallel, to yield large global data sets in which each read represents a sequence from an individual molecule. The frequency of occurrence of a read in the population of reads can now be used to measure the concentration of individual DNA molecules. Sequencing DNA libraries in this fashion significantly lowers costs, and makes previously cost prohibitive experiments possible. It also changes how we need to think about and perform our experiments.

The first change is that preparing the DNA library is the experiment. Tag profiling, RNA-seq, small RNA, ChIP-seq, DNAse hypersensitivity, methylation, and other assays all have specific ways in which DNA libraries are prepared. Starting materials and fragmentation methods define the experiment and how the resulting datasets will be analyzed and interpreted. The second change is that large numbers of clones no longer need to be prepared, tracked, and stored. This reduces the number of people needed to process samples, and reduces the need for robotics, large number of thermocyclers, and other laboratory equipment. Work that used to require a factory setting can now be done in a single laboratory, or mailroom if you believe the ads.

Attention to details counts

Even though Next Gen sequencing gives us the technical capabilities to ask detailed and quantitative questions about gene structure and expression, successful experiments demand that we pay close attention to the details. Obtaining data that are free of confounding artifacts and accurately represent the molecules in a sample, demands good technique and a focus on detail. DNA libraries no longer involve cloning, but their preparation does require multiple steps performed over multiple days. During this process, different kinds of data ranging from gel images to discrete data values, may be collected and used later for trouble shooting. Tracking the experimental details requires that a system be in place that can be configured to collect information from any number and kind of process. The system also needs to be able to link data to the samples, and convert the information from millions of sequence data points to tables, graphics and other representations that match the context of the experiment and give a global view of how things are working. FinchLab is that kind of system.

Friday, August 8, 2008

ChIP-ing Away at Analysis

ChiP-Seq is becoming a popular way to study the interactions between proteins and DNA. This new technology is made possible by the Next Gen sequencing techniques and sophisticated tools for data management and analysis. Next Gen DNA sequencing provides the power to collect the large amounts of data required. FinchLab is the software system that is needed to track the lab steps, initiate analysis, and see your results.

In recent posts, we stressed the point that unlike Sanger sequencing, Next Gen sequencing demands that data collection and analysis be tightly coupled, and presented our initial approach of analyzing Next Gen data with the Maq program. We also discussed how the different steps (basecalling, alignment, statistical analysis) provide a framework for analyzing Next Gen data and described how these steps belong to three phases: primary, secondary, and tertiary data analysis. Last, we gave an example of how FinchLab can be used to characterize data sets for Tag Profiling experiments. This post expands the discussion to include characterization of data sets for ChIP-Seq.


ChiP (Chromosome Immunoprecipitation) is a technique where DNA binding proteins, like transcription factors, can be localized to regions of a DNA molecule. We can use this method to identify which DNA sequences control expression and regulation for diverse genes. In the ChIP procedure, cells are treated with a reversible cross-linking agent to "fix" proteins to other proteins that are nearby, as well as the chromosomal DNA where they're bound. The DNA is then purified and broken into smaller chunks by digestion or shearing and antibodies are used to precipitate any protein-DNA complexes that contain their target antigen. After the immunoprecipitation step, unbound DNA fragments are washed away, the bound DNA fragments are released, and their sequences are analyzed to determine the DNA sequences that the proteins were bound to. Only few years ago, this procedure was much more complicated than it is today, for example, the fragments had to be cloned before they could be sequenced. When microarrays became available, a microarray-based technique called ChIP-on-chip made this assay more efficient by allowing a large number of precipitated DNA fragments to be tested in fewer steps.

Now, Next Gen sequencing takes ChIP assays to a new level [1]. In ChIP-seq the same cross linking, isolation, immunoprecipitation, and DNA purification steps are carried out. However, instead of hybridizing the resulting DNA fragments to a DNA array, the last step involves adding adaptors and sequencing the individual DNA fragments in parallel. When compared to microarrays, ChiP-seq experiments are less expensive, require fewer hands-on steps and benefit from the lack of hybridization artifacts that plague microarrays. Further, because ChIP-seq experiments produce sequence data, they allow researchers to interrogate the entire chromosome. The experimental results are no longer to the probes on the micoarray. ChIP-Seq data are better at distinguishing similar sites and collecting information about point mutations that may give insights into gene expression. No wonder ChIP-Seq is growing in popularity.


To perform a ChIP-seq experiment, you need to have a Next Gen sequencing instrument. You will also need to have the ability to run an alignment program and work with the resulting data to get your results. This is easier said than done. Once the alignment program runs, you might have to also run additional programs and scripts to translate raw output files to meaningful information. The FinchLab ChIP-seq pipeline, for example, runs Maq to generate the initial output, then runs Maq pileup to convert the data to a pileup file. The pileup file is then read by a script to create the HTML report, thumbnail images to see what is happening and "wig" files that can be viewed in the UCSC Genome Browser. If you do this yourself, you have to learn the nuances of the alignment program, how to run it different ways to create the data sets, and write the scripts to create the HTML reports, graphs, and wig files.

With FinchLab, you can skip those steps. You get the same results by clicking a few links to sort the data, and a few more to select the files, run the pipeline, and view the summarized results. You can also click a single link to send the data to the UCSC genome browser for further exploration.


ChIP-seq: welcome to the new frontier Nature Methods - 4, 613 - 614 (2007)