FinchTalk: Genetic Analysis

Showing posts with label Genetic Analysis. Show all posts

Tuesday, December 4, 2012

Commonly Rare

Rare is the new common. The final month of the year is always a good time to review progress and think about what's next. In genetics, massively parallel next generation sequencing (NGS) technologies have been a dominating theme, and for good reason.

Unlike the previous high-throughput genetic analysis technologies (Sanger sequencing and microarrays), NGS allows us to explore genomes in far deeper ways and measure functional elements and gene expression in global ways.

What have we learned?

Distribution of rare and common variants. From [1]

The ENCODE project has produced a picture where a much greater fraction of the genome may be involved in some functional role than previously understood [1]. However, a larger theme has been related to observing rare variation, and trying to understand its impact on human health and disease. Because the enzymes that replicate DNA and correct errors are not perfect, each time a genome is copied a small number of mutations are introduced, on average between 35-80. Since sperm are continuously produced, fathers contribute more mutations than mothers, and the number of new mutations increases with the father's age [2]. While the number per child, with respect to their father's contributed three-billion base genome, is tiny, rare diseases and intellectual disorders can result.

A consequence is that the exponentially growing human population has accumulated a very large number of rare genetic variants [3]. Many of these variants can be predicted to affect phenotype and many more may modify phenotypes in yet unknown ways [4,5]. We are also learning that variants generally fall into two categories. They are either common to all populations or confined to specific populations (figure). More importantly, for a given gene the number of rare variants can vastly outnumber of the number of previously known common variants.

Another consequence of the high abundance of rare variation is how it impacts the resources that are used to measure variation and map disease to genotypes. For example, microarrays, which have been the primary tool of genome wide association studies utilize probes developed from a human reference genome sequence. When rare variants are factored in, many probes have several issues ranging from "hidden" variation within a probe to a probe simply not being able to measure a variant that is present. Linkage block size is also affected [6]. What this means it the best arrays going forward will be tuned to specific populations. It also means we need to devote more energy to developing refined reference resources, because the current tools do not adequately account for human diversity [6,7].

What's next?

Rare genetic variation has been understood for sometime. What's new is understanding just how extensive these variants are in the human population, which has resulted from the population recently rapidly expanding under very little selective pressure. Hence, linking variation to heath and disease is the next big challenge and the cornerstone of personalized medicine, or as some would like precision medicine. Conquering this challenge will require detailed descriptions of phenotypes, in many cases at the molecular level. As the vast majority of variants, benign or pathogenic, lie outside of coding regions we will need to deeply understand how those functional elements, as initially defined by ENCODE, are affected by rare variation. We will also need to layer in epigenetic modifications.

For the next several years the picture will be complex.

References:

1. 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), 56-65 PMID: 23128226

[2] Kong, A., et. al. (2012). Rate of de novo mutations and the importance of father’s age to disease risk Nature, 488 (7412), 471-475 DOI: 10.1038/nature11396

[3] Keinan, A., and Clark, A. (2012). Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants Science, 336 (6082), 740-743 DOI: 10.1126/science.1217283

[4] Tennessen, J., et. al. (2012). Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes Science, 337 (6090), 64-69 DOI: 10.1126/science.1219240

[5] Nelson, M., et. al. (2012). An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People Science, 337 (6090), 100-104 DOI: 10.1126/science.1217876

[6] Rosenfeld JA, Mason CE, and Smith TM (2012). Limitations of the human reference genome for personalized genomics. PloS one, 7 (7) PMID: 22811759

[7] Smith TM., and Porter SG. (2012) Genomic Inequality. The Scientist.

Thursday, July 12, 2012

Resources for Personalized Medicine Need Work

Yesterday (July 11, 2012), PLoS ONE published an article prepared by my colleagues and myself entitled "Limitations of the Reference Genome for Personalized Genomics."

This work, supported by Geospiza's SBIR targeting ways to improve mutation detection and annotation, explored some the resources and assumptions that are used to measure and understand sequence variation. As we know, a key deliverable of the human genome project was to produce a high quality reference sequence that could be used to annotate genes, develop research tools like genotyping and microarray assays, and provide insights to guide software development. Projects like HapMap used these resources to provide additional understandings in terms of genetic linkage in populations.

Decreasing sequencing costs

Since those early projects, DNA sequencing costs have plummeted. As a result, endeavors such as the 1000 Genomes Project (1KGP) and public contributions from Complete Genomics (CG) have dramatically increased the number of known sequence variants. A question worth asking is how do these new data contribute to an understanding of the utility of current resources and assumptions that have guided genomics and genetics for the past six or seven years?

Number of variants by dbSNP build

To address the above question, we evaluated several assay and software tools that were based on the human genome reference sequence in the context of new data contributed by 1KGP and CG. We found a high frequency of confounding issues with microarrays, and many cases where invalid assumptions, encoded in bioinformatics programs, underestimate variability or possibly misidentify the functional effects of mutations. For example, 34% of published array-based GWAS studies for a variety of diseases utilize probes that contain undocumented variation or map to regions of previously unknown structural variation. Similarly, assumptions about the size of linkage disequillibrium decrease as the numbers of variants increase.

The significance of this work is that it documents what many are anecdotally experiencing. As we continue to learn about the contributing role of rare variation in human disease we need to fully understand how current resources can be used and work to resolve discrepancies in order to create an era of personalized medicine.

(2012). Limitations of the Human Reference Genome for Personalized Genomics, PLoS ONE, DOI: 10.1371/journal.pone.0040294.t002

Thursday, October 13, 2011

Personalities of Personal Genomes

"People say they want their genetic information, but they don’t." "The speaker's views of data return are frankly repugnant." These were some of the [paraphrased] comments and tweets expressed during Cold Spring Harbor's fourth annual conference entitled "Personal Genomes" held Sep 30 - Oct 2, 2011. The focus of which was to explore the latest technologies and approaches for sequencing genomes, exomes, and transcriptomes in the context of how genome science is, and will be, impacting clinical care.

The future may be close than we think

In previous years, the concept of personal genome sequencing as a way to influence medical treatment was a vision. Last year, the reality of the vision was evident through a limited number of examples. This year, several new examples were presented along with the establishment of institutional programs for genomic-based medicine. The driver being the continuing decreases in data collection costs combined with corresponding access to increasing amounts of data. According to Richard Gibbs (Baylor College of Medicine) we will have close to 5000 genomes completely sequenced by the end of this year and by the end of 2012, 30,000 complete genome sequences are expected.

The growth of genome sequencing is now significant enough that leading institutions are also beginning to establish guidelines for genomics-based medicine. Hence, an ethics panel discussion was held during the conference. The conversation about how DNA sequence data may be used has been an integral discussion since the beginning of the Genome Project. Indeed James Watson shared his lament for having to fund ethics research and directly asked the panel if they have done any good. There was a general consensus, from the panel, and audience members who have had their genomes sequenced, that ethics funding has helped by establishing genetic counseling and eduction practices.

However, as pointed out by some audience members, this ethics panel, like many others, focused too heavily on the risks for individuals and society having their genomic data. In my view, the discussion would have been more interesting and balanced if the panel included the individuals who are working outside of institutions with new approaches for understanding health. Organizations like 23andMe, Patients LIke Me, or the Genetic Alliance bring a very different and valuable perspective to the conversation.

Ethics was a fraction of the conference. The remaining talks at were organized into six sessions that covered personal cancer genomics, medically actionable genomics, personal genomes, rare diseases, and clinical implementations of personal genomics. The key messages from these presentations and posters was that, while genomics-based medical approaches have demonstrated success, much more research needs to be done before such approaches are mainstream.

For example, in the case of cancer genomics, whole genome sequences from tumor and normal cells can give a picture of point mutations and structural rearrangements, but these data need to be accompanied by exome sequences to get the high read depth needed to accurately detect the low levels of rare mutations that may be disregulating cell growth or conferring resistance to treatment. Yet, the resulting profiles of variants are still inadequate to fully understand the functional consequences of the mutations. For this, transcriptome profiling is needed, and that is just the start.

Once the data are collected they need to be processed in different ways, filtered, and compared within and between samples. Information from many specialized databases will be used in conjunction with statistical analyses to develop insights that can be validated through additional assays and measurements. Finally, a lab seeking to do this work, and return results back to patients, will also need to be certified, minimally by CLIA standards. For many groups this is significant undertaking, and good partners with experience and strong capabilities like PerkinElmer will be needed.

Further Reading

Nature Coverage, Oct 6 issue:

Secrets of the human genome disclosed

Nature readers flirt with genomics

Genomes on prescription
Other news and information:

At CSHL conference, researchers highlight importance of RNA-seq data to guide cancer treatment

Personal Genomes 2011 Meeting Site

Conference Tweets

Friday, June 10, 2011

Sneak Peak: NGS Resequencing Applications: Part I – Detecting DNA Variants

Join us next Wed. June 15 for a webinar on resequencing applications.

Description:

This webinar will focus on DNA variant detection using Next Generation Sequencing for the applications of targeted and exome resequencing as well as, whole transcriptome sequencing. The presentation will include an overview of each application and its specific data analysis needs and challenges. Topics covered will include Secondary Analysis (alignments, reference choices, variant detection) with a particular emphasis on DNA variant detection as well as multi-sample comparisons. For in depth comparisons of variant detection methods, Geospiza’s cloud-based GeneSifter Analysis Edition software will be used to assess sample data from NCBI’s GEO and SRA. The webinar will also include a short presentation on how these tools can be deployed for both individual researchers as well as through Geospiza’s Partner Program for NGS sequencing service providers.

Details:

Date and time: Wednesday, June 15, 2011 10:00 am
Pacific Daylight Time (San Francisco, GMT-07:00)
Wednesday, June 15, 2011 1:00 pm

Eastern Daylight Time (New York, GMT-04:00)

Wednesday, June 15, 2011 6:00 pm
GMT Summer Time (London, GMT+01:00)
Duration: 1 hour

Visit the webex site to register.

Wednesday, April 6, 2011

Sneak Peak: RNA-Sequencing Applications in Cancer Research: From fastq to differential gene expression, splicing and mutational analysis

Join us next Tuesday, April 12 at 10:00 am PST for a webinar focused on RNA-Seq applications in breast cancer research.

The field of cancer genomics is advancing quickly. News reports from the annual American Association of Cancer Research meeting are indicating that whole genome sequencing studies such as the 50 breast cancer genomes (WashU) are providing more clues about the genes that may be affected in cancer. Meanwhile, the ACLU/Myriad Genetics legal action over genetic testing for breast cancer mutations and disease predisposition continues to move towards the supreme court.

Breast cancer, like many other cancers, is complex. Sequencing genomes is one way to interrogate cancer biology. However, the genome sequence data in isolation does not tell the complete story. The RNA, representing expressed genes, their isoforms, and non-coding RNA molecules, needs to be measured too. In this webinar, Eric Olson, Geospiza's VP of product development and principal designer of GeneSifter Analysis Edition, will explore the RNA world of breast cancer and present how you can explore existing data to develop new insights.

Abstract
Next Generation Sequencing applications allow biomedical researchers to examine the expression of tens of thousands of genes at once, giving researchers the opportunity to examine expression across entire genomes. RNA Sequencing applications such as Tag Profiling, Small RNA and Whole Transcriptome Analysis can identify and characterize both known and novel transcripts, splice junctions and non-coding RNAs. These sequencing based-applications also allow for the examination of nucleotide variant. Next Generation Sequencing and these RNA applications allow researchers to examine the cancer transcriptome at an unprecedented level. This presentation will provide an overview of the gene expression data analysis process for these applications with an emphasis on identification of differentially expressed genes, identification of novel transcripts and characterization of alternative splicing as well as variant analysis and small RNA expression. Using data drawn from the GEO data repository and the Short Read Archive, NGS Tag Profiling, Small RNA and NGS Whole Transcriptome Analysis data will be examined in Breast Cancer.

You can register at the webex site, or view the slides after the presentation.

Wednesday, November 3, 2010

Samples to Knowledge

Today Geospiza and Ingenuity announced a collaboration to integrate our respective GeneSifter Analysis Edition (GSAE) and Ingenuity Pathway Analysis (IPA) software systems.

Why is this important?

Geospiza has always been committed to providing our customers the most complete software systems for genetic analysis. Our LIMS [GeneSifter Laboratory Edition] and GSAE have worked together to form a comprehensive samples to results platform. From core labs, to individual research groups, to large scale sequencing centers, GSLE is used for collecting sample information, tracking sample processing, and organizing the resulting DNA sequences, microarray files, and other data. Advanced quality reports keep projects on track and within budget.

For many years, GSAE has provided a robust and scalable way to scientifically analyze the data collected for many samples. Complex datasets are reduced and normalized to produce quantitative values that can be compared between samples and within groups of samples. Additionally, GSAE has integrated open-source tools like Gene Ontologies and KEGG pathways to explore the biology associated with lists of differentially expressed genes. In the case of Next Generation Sequencing, GSAE has had the most comprehensive and integrated support for the entire data analysis workflow from basic quality assessment to sequence alignment and comparative analysis.

With Ingenuity we will be able to take data-driven biology exploration to a whole new level. The IPA system is a leading platform for discovering pathways and finding the relevant literature associated with genes and lists of genes that show differential expression in microarray analysis. Ingenuity's approach focuses on combining software curation with expert review to create a state-of-the-art system that gets scientists to actionable information more quickly than conventional methods.

Through this collaboration two leading companies will be working together to extend their support for NGS applications. GeneSifter's pathway analysis capabilities will increase and IPA's support will extend to NGS. Our customers will benefit by having access to the most advanced tools for turning vast amounts of data into biologically meaningful results to derive new knowledge.

Samples to Results^TM becomes Samples to Knowledge^TM

Monday, May 17, 2010

Journal Club: GeneSifter Aids Stem Cell Research

Last week’s Nature featured an article entitled “Aberrant silencing of imprinted genes on chromosome 12qF1 in mouse induced pluripotent stem cells [1]” in which GeneSifter Analysis Edition (GSAE) was used to compare gene expression between genetically identical mouse embryonic stem (ES) cells and induced pluripotent stem cells (iPSCs).

Stem Cells

Stems cells are undifferentiated, pluripotent, cells that later develop into the specialized cells of tissues and organs. Pluripotent cells can divide essentially without limit, become any kind of cell, and have been found to naturally repair certain tissues. They are the focus of research because of their potential for treating diseases that damage tissues. Initially stem cells were isolated from embryonic tissues. However, with human cells, this approach is controversial. In 2006 researchers developed ways to “reprogram” somatic cells to become pluripotent cells [2]. In addition to being less controversial, iPSCs have other advantages, but there are open questions as to their therapeutic safety due to potential artifacts introduced during the reprogramming process.

Reprogramming cells to become iPSCs involves the overexpression of a select set of transcription factors by viral transfection, DNA transformation, and other methods. To better understand what happens during reprogramming, researchers have examined gene expression and DNA methylation patterns between ES cells and iPSCs and have noted major differences in mRNA and microRNA expression as well as DNA methylation patterns. As noted in the paper, a problem with previous studies is that they compared cells with different genetic backgrounds. That is, the iPSCs harbor viral transgenes that are not present in the ES cells, and the observed differences could likely be due to factors unrelated to reprogramming. Thus, a goal of this paper's research was to compare genetically identical cells to pinpoint the exact mechanisms of reprogramming.

GeneSifter in Action

Comparing genetically similar cells requires that both ES cells and iPSCs have the same transgenes. To accomplish this goal, Stadtfeld and coworkers devised a clever strategy whereby they created a novel inducible transgene cassette and introduced it into mouse ES cells. The modified ES cells were then used to generate cloned mice containing the inducible gene cassette in all of their cells. Somatic cells could be converted to iPSCs by adding the appropriate inducing agents to the tissue culture media.

Even though ES cells and iPSCs were genetically identical, ES cells were able to generate live mice whereas iPSCs could not. To understand why, the team looked at gene expression using microarrays. The mRNA profiles for six iPSC and four ES cell replicates were analyzed in GeneSifter. Unsupervised clustering showed that global gene expression was similar for all cells. When the iPSC and ES cell data were compared using correlation analysis, the scatter plot identified two differentially expressed transcripts corresponding to a non-coding RNA (Gtl2) and small nucleolar RNA (Rian). The transcripts’ genes map to the imprinted Dlk1-Dio3 gene cluster on mouse chromosome 12qF1. While these genes were strongly repressed in iPSC clones, the expression of housekeeping and pluripotentency cells was unaffected as demonstrated using GeneSifter’s expression heat maps.

Subsequent experiments that looked at gene expression from over 60 iPSC lines produced from different types of cells and chimeric mice that were produced from mixtures of iPSCs and stem cells showed that the gene silenced iPSCs had limited development potential. Because the Dlk3-Dio cluster imprinting is regulated by methylation, methylation patterns revealed that the Gtl2 allele had acquired an aberrant silent state in the iPSC clones. Finally, by knowing that Dlk3-Dio cluster imprinting is also regulated by histone acetylation, the authors were able to treat their iPSCs with a histone deacetylase inhibitor and produce live animals from the iPSCs. Producing live animals from iPSCs in a significant milestone for the field.

While histone deacetylase inhibitors have multiple effects, and more work will need to be done, the authors have completed a tour de force of work in this exciting field, and we are thrilled that our software could assist in this important study.

Further Reading

1. Stadtfeld M., Apostolou E., Akutsu H., Fukuda A., Follett P., Natesan S., Kono T., Shioda T., Hochedlinger K., 2010. "Aberrant silencing of imprinted genes on chromosome 12qF1 in mouse induced pluripotent stem cells." Nature 465, 175-181.

2. Takahashi K., Yamanaka S., 2006. "Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors." Cell 126, 663-676.

Stem Cell Basics: http://stemcells.nih.gov/info/basics

iPSCs: http://en.wikipedia.org/wiki/IPS_cells

Tuesday, May 11, 2010

Journal Club: Decoding Biology

DNA sequences hold the information needed to create proteins and regulate their abundance. Genomics research focuses on deciphering the codes that control these processes by combining DNA sequences with data form assays that measure gene expression and protein interactions. The codes are deciphered when specific sequence elements (motifs) are identified and can be later used to predict outcomes. The recent Nature article “Deciphering the Splicing Code,” begins to reveal the codes of alternative splicing.

The genetic codes

Since the discovery that DNA is a duplex molecule [1] which stores and replicates the information of living systems, the goal of modern biology has been to understand how the blueprint of a living system is encoded in its DNA. The first quest was to learn how DNA's four letter nucleotide code was translated into the 20 letter amino acid code of proteins. Experiments conducted in the 1960’s revealed that different combinations of triplet DNA bases encoded specific amino acids to produce the “universal” genetic code, which is nearly identical in all species that have been examined to date [2].

Translating mRNA into protein is a complex process, however, that involves many proteins and ribosomal RNA (rRNA) collectively organized in ribosomes. As the ribosomes read the mRNA sequence, transfer RNA (tRNA) molecules bring individual amino acids to the ribosome where they are added to a growing polypeptide chain. The universal genetic code explained how tri-nucleotide sequences specified amino acids. It could also be used to elucidate the anti-codon portion of tRNA [3], but it could not explain how the correct amino acid was added to the tRNA. For that another genetic code needed to be cracked. In this code, first proposed in 1988 [4], multiple sequences, including the anti-codon loop, within each tRNA molecule are recognized by a matched enzyme that combines an amino acid with its appropriate tRNA.

Codes to create diversity

The above codes are involved with the process of translating genetic sequences into protein. Most eukaryotic genes, and a few prokaryotic genes, cannot be translated in a continuous way because the protein coding regions (exons) are interrupted by non-coding regions (introns). When DNA is first transcribed into RNA, all regions are included and the introns must be excised to form the final messenger RNA (mRNA). This process makes it possible to create many different proteins from a single gene through alternative splicing in which exons are either differentially removed or portions of exons are joined together. Alternative splicing occurs in development and tissue specific ways; many disease causing mutations disrupt splicing patterns. So, understanding the codes that control splicing is an important research topic.

Some of the splicing codes, such as the exon boundaries, are well known, and others are not. In “Deciphering the Splicing Code,” Barash and colleagues looked at thousands of alternatively spliced exons - and surrounding intron sequences - from 27 mouse tissues to unravel over 1000 sequence features that could define a new genetic code. Their goal is build catalogs of motifs that could be used to predict splicing patterns of uncharacterized exons and determine how mutations might affect splicing.

Using data from existing microarray experiments, RNA sequence features compiled from the literature, and other known attributes of RNA structure, Barash and co-workers developed computer models to determine which combinations of features best correlated with experimental observations. The resulting computer program provided tissue specific splicing predictions of whether an exon would be included or excluded based on its surrounding motif sequences and tissue type with reasonable success. More importantly, the program could be used to identify interaction networks that identified pairs of motifs that were frequently observed together.

Predicting alternative splicing is at an early stage, but as pointed out be the editorial summary, the approach of Barash and co-workers will be improved by the massive amounts of data being generated by new sequencing technologies and applications like RNA-Seq and various protein binding assays. The real test will be expanding the models to new tissues and human genomics. In the meantime, if you want to test their models on some of your data or explore new regulatory elements, the Frey lab has developed a web tool that can be accessed at http://genes.toronto.edu/wasp/.

I’m done with seconds, can I have a third?

As an aside, the authors of the editorial summary coined the work as the second genetic code. I find this amusing, because this would be the third second genetic code. The aminoacyl tRNA code was also coined the second genetic code, but people must have forgotten that, because another second genetic code was proposed in 2001. This genetic code describes how methylated DNA sequences regulate chromatin structure and gene regulation. Rather than have a third second genetic code, maybe we should refer to this as the third genetic code or the next generation code.

Further Reading

1. Watson JD, and Crick F (1953). "A structure for deoxyribose nucleic acid". Nature 171: 737–8.

2. http://en.wikipedia.org/wiki/Genetic_code

3. http://nobelprize.org/nobel_prizes/medicine/laureates/1968/holley-lecture.pdf

4. Hou YM, Schimmel P (1988) "A simple structural feature is a major determinant of the identity of a transfer RNA." Nature 333:140-5.

Monday, July 6, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part IV: HDF5 Benefits

Now that we're back from Alaska and done with the 4th of July fireworks, it's time to present the next installment of our series on BioHDF.

HDF highlights

HDF technology is designed for working with large amounts of scientific data and is well suited for Next Generation Sequencing (NGS). Scientific data are characterized by very large datasets that contain discrete numeric values, images, and other data, collected over time from different samples and locations. These data naturally organize into multidimensional arrays. To obtain scientific information and knowledge, we combine these complex datasets in different ways and (or) compare them to other data using multiple computational tools. One difficulty that plagues this work is that the software applications and systems for organizing the data, comparing datasets, and visualizing the results are complicated, resource intensive, and challenging to develop. Many of the development and implementation challenges can be overcome using the HDF5 file format and software library

Previous posts have covered:
1. An introduction
2. Backg round of the project
3. Complexities of NGS data analysis and performance advantages offered by the HDF platform.

HDF5 changes how we approach NGS data analysis.

As previously discussed, the NGS data analysis workflow can be broken into three phases. In the first phase (primary analysis) images are converted into short strings of bases. Next, the bases, represented individually or encoded as di-bases (SOLiD), are aligned to reference sequences (secondary analysis) to create derivative data types such as contigs or annotated tables of alignments, that are further analyzed (tertiary analysis) in comparative ways. Quantitative analysis applications, like gene expression, compare the results of secondary analyses between individual samples to measure gene expression and identify mRNA isoforms, or make other observations based on a sample’s origin or treatment.

The alignment phase of the data analysis workflow creates the core information. During this phase, reads are aligned to multiple kinds of reference data to understand sample and data quality, and obtain biological information. The general approach is to align reads to sets of sequence libraries (reference data). Each library contains a set of sequences that are annotated and organized to provide specific information.

Quality control measures can be added at this point. One way to measure data quality is to ask how many reads were obtained from constructs without inserts. Aligning the read data to a set of primers (individually and joined in different ways) that were used in the experiment, allows us to measure the number reads that match and how well they match. A higher quality dataset will have a larger proportion of sequences matching our sample and a smaller proportion of sequences that only match the primers. Similarly, different biological questions can be asked using libraries constructed of sequences that have biological meaning.

Aligning reads to sequence libraries is the easy part. The challenge is analyzing the alignments. Because the read datasets in NGS assays are large, organizing alignment data into forms we can query is hard. The problem is simplified by setting up multistage alignment processes as a set of filters. That is, reads that match one library are excluded from the next alignment. Differential questions are then asked by counting the numbers of reads that match each library. With this approach, each set of alignments is independent of the other alignments and a program only needs to analyze one set of alignments at time. Filter-based alignment is also used to distinguish reads with perfect matches from those with one or more mismatches.

Still, filter-based alignment approaches have several problems. When new sequence libraries are introduced, the entire multistage alignment process must be repeated to update results. Next, information about reads that have multiple matches in different libraries, or perfect matches and non-perfect matches within a library are lost. Finally, because alignment formats between programs differ and good methods for organizing alignment data do not exist, it is hard to compare alignments between multiple samples. This last issue also creates challenges for linking alignments to the original sequence data and formatting information for other tools.

As previously noted, solving the above problems requires that alignment data be organized in ways that facilitate computation. HDF5 provides the foundation to organize and store both read and alignment data to enable different kinds of data comparisons. This ability is demonstrated by the following two examples.

In the first example (left), reads from different sequencing platforms (SOLiD, Illumina, 454) were stored in HDF5. Illumina RNA-Seq reads from three different samples were aligned to the human genome and annotations from a UCSC GFF (genome file format) file were applied to define gene boundaries. The example shows the alignment data organized into three HDF5 files, one per sample, but in reality the data could have been stored in a single file or files organized in other ways. One of HDF's strengths is that the HDF5 I/O library can query multiple files as if they were a single file, providing the ability to create the high-level data organizations that are the most appropriate for a particular application or use case. With reads and alignments structured in these files, it is a simple matter to integrate data to view base (color) compositions for reads from different sequencing platforms, compare alternative splicing between samples, and select a subset of alignments from a specific genomic region, or gene, in a "wig" format for viewing in a tool like the UCSC genome browser.

The second example (right) focuses on how organizing alignment data in HDF5 can change how differential alignment problems are approached. When data are organized according to a model that defines granularity and relationships, it becomes easier to compute all alignments between reads and multiple reference sources, than think about how to perform differential alignments and implement the process. In this case, a set of reads (obtained from cDNA) are aligned to primers, QC data (ribosomal RNA [rRNA] and mitochondrial DNA [mtDNA]), miRBase, refseq transcripts, the human genome, and a library of exon junctions. During alignment up to three mismatches are tolerated between a read and its hit. Alignment data are stored in HDF5 and, because the data were not filtered, a greater variety of questions can be asked. Subtractive questions mimic the differential pipeline where alignments are used to filter reads from subsequent steps. At the same time, we can also ask "biological" questions about the number of reads that came from rRNA or mtDNA or from genes in the genome or exon junctions. And for these questions, we can examine the match quality between each read and its matching sequence in the reference data sources, without having to reprocess the same data multiple times.

The above examples demonstrate the benefits of being able to organize data into structures that are amenable to computation. When data are properly structured, new approaches that expand the ways in which data are analyzed can be implemented. HDF5 and its library of software routines move the development process from activities associated with optimizing the low level infrastructures needed to support such systems to designing and testing different data models and exploiting their features.

The final post of this series will cover why we chose to work with HDF5 technology.

Sunday, March 8, 2009

Bloginar: Next Gen Laboratory Systems for Core Facilities

Geospiza kicked off February by attending the AGBT and ABRF conferences. As part of our participation at ABRF, we presented a scenario, in our poster, where a core lab provides Next Generation Sequencing (NGS) transcriptome analysis services. This story shows how GeneSifter Lab and Analysis Edition’s capabilities overcome the challenges of implementing NGS in a core lab environment.

Like the last post, which covered our AGBT poster, the following poster map will guide the discussion.

As this poster overlaps the previous poster in terms providing information about RNA assays and analyzing the data, our main points below will focus on how GeneSifter Lab Edition solves challenges related to laboratory and business processes associated with setting up a new lab for NGS or bringing NGS into an existing microarray or Sanger sequencing lab.

Section 1 contains the abstract, an introduction to the core laboratory, and background information on different kinds of transcription profiling experiments.

The general challenge for a core lab lies in the need to run a business that offers a wide variety of scientific services for which samples (physical materials) are converted to data and information that have biological meaning. Different services often require different lab processes to produce different kinds of data. To facilitate and direct lab work, each service requires specialized information and instructions for samples that will be processed. Before work is started, the lab must review the samples and verify that the information has been correctly delivered. Samples are then routed through different procedures to prepare them for data collection. In the last steps, data are collected, reviewed, and the results are delivered back to clients. At the end of the day (typically monthly), orders are reviewed and invoices are prepared either directly or by updating accounting systems.

In the case of NGS, we are learning that the entire data collection and delivery process gets more complicated. When compared to Sanger sequencing, genotyping, or other assays that are run in 96-well formats, sample preparation is more complex. NGS requires that DNA libraries be prepared and different steps of the of process need to be measured and tracked in detail. Also, complicated bioinformatics workflows are needed to understand the data from both a quality control and biological meaning context. Moreover, NGS requires a substantial investment in information technology.

Section 2 walks through the ways in which GeneSifter Lab Edition helps to simplify the NGS laboratory operation.

Order Forms

In the first step, an order is placed. Screenshots show how GeneSifter can be configured for different services. Labs can define specialized form fields using a variety of user interface elements like check boxes, radio buttons, pull down menus, and text entry fields. Fields can be required or be optional and special rules such as ranges for values can be applied to individual fields within specific forms. Orders can also be configured to take files as attachments to track data, like gel images, about samples. To handle that special “for lab use only" information, fields in forms can be specified as laboratory use only. Such fields are hidden to the customers view and when the orders are processed they are filled later by lab personnel. The advantage of GeneSifter’s order system is that the pertinent information is captured electronically in the same system that will be used to track sample processing and organize data. Indecipherable paper forms are eliminated along with the problem of finding information scattered on multiple computers.

Web-forms do create a special kind of data entry challenge. Specifically, when there is a lot of information to enter for a lot samples, filling in numerous form fields on a web-page can be a serious pain. GeneSifter solves this problem in two ways:

First, all forms can have “Easy Fill” controls that provide column highlighting (for fast tab-and-type data entry), auto fill downs, and auto fill downs with number increments so one can easily “copy” common items into all cells of a column, or increment an ending number to all values in a column. When these controls are combined with the “Range Selector,” a power web-based user interface makes it easy to enter large numbers of values quickly in flexible ways.

Second, sometimes the data to be entered is already in an Excel spreadsheet. To solve this problem, each form contains a specialized Excel spreadsheet validator. The form can be downloaded as an Excel template and the rules, previously assigned to field when the form was created, are used to check data when they are uploaded. This process spots problems with data items and reports ten at upload time when they are easy to fix, rather than later when information is harder to find. This feature eliminates endless cycles of contacting clients to get the correct information.

Laboratory Processing

Once order data are entered, the next step is to process orders. The middle of section 2 describes this process using an RNA-Seq assay as an example. Like other NGS assays, the RNA-Seq protocol has many steps involving RNA purification, fragmentation, random primed conversion into cDNA, and DNA library preparation of the resulting cDNA for sequencing. During the process, the lab needs to collect data on RNA and DNA concentration as well as determine the integrity of the molecules throughout the process. If a lab runs different kinds of assays they will have to manage multiple procedures that may have different requirements for ordering of steps and laboratory data that need to be collected.

By now it is probably not a surprise to learn that GeneSifter Lab Edition has a way to meet this challenge too. To start, workflows (lab procedures) can be created for any kind of process with any number of steps. The lab defines the number of steps and their order and which steps are required (like the order forms). Having the ability to mix required and optional steps in a workflow gives a lab the ultimate flexibility to support those “we always do it this way, except the times we don’t” situations. For each step the lab can also define whether or not any additional data needs to be collected along the way. Numbers, text, and attachments are all supported so you can have your Nanodrop and Bioanalyzer too.

Next, an important feature of GeneSifter workflows is that a sample can move from one workflow to another. This modular approach means that separate workflows can be created for RNA preparation, cDNA conversion, and sequencing library preparation. If a lab has multiple NGS platforms, or a combination of NGS and microarrays, they might find that a common RNA preparation procedure is used, but the processes diverge when the RNA is converted into forms for collecting data. For example, aliquots of the same RNA preparation may be assayed and compared on multiple platforms. In this case a common RNA preparation protocol is followed, but sub-samples are taken through different procedures, like a microarray and NGS assay, and their relationship to the “parent” sample must be tracked. This kind of scenario is easy to set up and execute in GeneSifter Lab Edition.

Finally, one of GeneSifter’s greatest advantages is that a customized system with all of the forms, fields, Excel import features, and modular workflows can be added by lab operators without any programming. Achieving similar levels of customization with traditional LIMS products takes months and years with initial and reoccurring costs of six or more figures.

Collecting Data

The last step of the process is collecting the data, reviewing it, and making sequences and results available to clients. Multiple screenshots illustrate how this works in GeneSifter Lab Edition. For each kind of data collection platform, a “run” object is created. The run holds the information about reactions (the samples ready to run) and where they will be placed in the container that will be loaded into the data collection instrument. In this context, the container is used to describe 96 or 384-well plates, glass slides with divided areas called lanes, regions, chambers, or microarray chips. All of these formats are supported and in some cases specialized files (sample sheets, plate records) are created and loaded into instrument collection software to inform the instrument about sample placement and run conditions for individual samples.

During the run, samples are converted to data. This process, different for each kind of data collection platform, produce variable numbers and kinds of files that are organized in completely different ways. Using tools that work with GeneSifter, raw data and tracking information are entered into the database to simplify access to the data at a later time. The database also associates sample names and other information with data files, eliminating the need to rename files with complex tracking schemes. The last steps of the process involve reviewing quality information and deciding whether to release data to clients or repeat certain steps of the process. When data are released, each client receives an email directing them to their data.

The lab updates the orders and optionally creates invoices for services. GeneSifter Lab Edition can be used to manage those business functions as well. We’ll cover GeneSifter’s pricing and invoicing tools at some other time, be assured they are as complete as the other parts of the system.

NGS requires more than simple data delivery

Section 3 covers issues related to the computational infrastructure needed to support NGS and the data analysis aspects of the NGS workflow. In this scenario, our core lab also provides data analysis services to convert those multi-million read files into something that can be used to study biology. Much of this covered in the previous post, so it will not be repeated here.

I will summarize by making the final points that Geospiza’s GeneSifter products cover all aspects of setting up a lab for NGS. From sample preparation, to collecting data, to storing and distributing results, to running complex bioinformatics workflows and presenting information in ways to get scientifically meaningful results, a comprehensive solution is offered. GeneSifter products can be delivered as hosted solutions to lower costs. Our hosted, Software as a Service, solutions allow groups to start inexpensively and manage costs as the needs scale. More importantly, unlike in-house IT systems, which require significant planning and implementation time to remodel (or build) server rooms and install computers, GeneSifter products get you started as soon as you decide to sign up.

Wednesday, March 4, 2009

Bloginar: The Next Generation Dilemma: Large Scale Data Analysis

Previous posts shared some the things we learned at the AGBT and ABRF meetings in early February. Now it is time to share the work we presented, starting with the AGBT poster, “The Next Generation Dilemma: Large Scale Data Analysis.”

The goal of the poster was to provide a general introduction to the power of Next Generation Sequencing (NGS) and a framework for data analysis. Hence, the abstract described the NGS general data analysis process; its issues and what we are doing for one kind of transcription profiling, RNA-Seq. Between then and now we learned a few things... And the project grew.

The map below guides my “bloginar” poster presentation. In keeping with the general theme of the abstract we focused on transcription analysis, but instead of focusing exclusively on RNA-Seq, the project expanded to compare three kinds of transcription profiling: RNA-Seq, Tag Profiling, and Small RNA Analysis. A link to the poster is provided at the end.

Section 1 provides a general introduction into NGS by discussing the ways NGS is being used to study different aspects of molecular biology. It also covers how the data are analyzed in thee phases (primary, secondary, tertiary) to convert raw data into biologically meaningful information. The three phase model has emerged as a common framework to describe the process of converting image data into primary sequence data (reads) and then turning the reads into information that be used in comparative analyses. Secondary analysis is the phase where reads are aligned to reference sequences to get gene names, position, and (or) frequency information that can be used to measure changes, like gene expression, between samples.

The remaining sections of the poster use examples from transcription analysis to illustrate and address the multiple challenges (listed below) that must be overcome to efficiently use NGS.

High end infrastructures are needed to manage and work with extremely large data sets
Complex, multistep analysis procedures are required to produce meaningful information
Multiple reference data are needed to annotate and verify data and sample quality
Datasets must be visualized in multiple ways
Numerous Internet resources must be used to fill in additional details
Multiple datasets must be comparatively analyzed to gain knowledge

Section 2 describes the three different kinds of transcription profiling experiments. This section provides additional background on the methods and what they measure. For example, RNA-Seq and Tag Profiling are commonly used to measure gene expression. In RNA-Seq, DNA libraries are prepared by randomly amplifying short regions of DNA from cDNA. The sequences that are produced will generally cover the entire region of the transcripts that were originally isolated. Hence, it is possible to get information about alternative splicing and biased allelic expression. In contrast, Tag Profiling focuses on creating DNA libraries from discrete points within the RNA molecules. With Tag Profiling, one can quickly measure relative gene expression, but cannot get information about alternative splicing and allelic expression. The table in section 2 discusses these and other issues one must consider when running the different assays.

Sections 3, 4, and 5 outline three transcriptome scenarios (RNA-Seq, Tag Profiling, and Small RNA, respectively) using real data examples (references provided in the poster). Each scenario follows a common workflow involving the preparation of DNA libraries from RNA samples, followed by secondary analysis, followed by tertiary analysis of the data in GeneSifter Analysis Edition.

For RNA-Seq, two datasets corresponding to mouse erythroid stem (ES) and body (EB) cells were investigated. DNA libraries were produced from each cell line. Sequences were collected from the library and compared to the RefSeq (NCBI) database according to the pipeline shown. The screen captures (middle of the panel) show how the individual reads map to each transcript along with the total numbers of hits summarized by chromosome. The process is repeated twice, once for each cell line, and the two sets of alignments are converted to Gene Lists for comparative analysis in GeneSifter laboratory edition to observe differential expression (bottom of the panel).

The Tag Profiling panel examines data from a recently published experiment (a reference is provided in the poster) in which gene expression was studied in transgenic mice. I’ll leave out the details of the paper, and only point out how this example shows the differences between Tag Profiling and RNA-Seq data. Because Tag Profiling collects data from specific 3’ sites in RNA, the aligned data (middle of the panel) show alignments as single “spikes” toward the 3’ end of transcripts. Occasionally multiple peaks are observed. The question being, are the additional peaks the result of isoforms (alternative polyA sites) or incomplete restriction enzyme digests? How might this be sorted out? Like RNA-Seq, the bottom panel shows the comparative analysis of replicate samples from the wild type (WT) and transgenic (TG) mice.

Data from a small RNA analysis experiment are analyzed in the third panel. Unlike RNA-Seq and Tag Profiling, this secondary analysis has more comparisons of the reads to different sets of reference sequences. The purpose is to identify and filter out common artifacts observed in small RNA preparations. The pipeline we used, and data produced, are shown in the middle of the panel. Histogram plots of read length distribution, determined from alignments in different reference sources, are created because an important feature of small RNAs is that they are small. Distributions clustered around 22 nt indicate a good library. Finally, data are linked to additional reports and databases, like miRBase (Sanger Center), to explore results further. In the example shown, the first hit was to a small RNA that has been observed in opossums; now we have human counter part. In total, four, samples were studied. Like RNA-Seq and Tag Profiling, we can observe the relative expression of each small RNA by analyzing the datasets together (hierarchical clustering, bottom).

Section 6 presents some of the challenges of scale issues that accompany NGS, and how we are addressing these issues with HDF5 technology. This will be a topic of many more posts in the future.

We close the poster by addressing the challenges listed above with the final points:

High performance data management systems are being developed through the BioHDF project and GeneSifter system architectures.
The examples show how each application and sequencing platform requires a different data analysis workflow (pipeline). GeneSifter provides a platform to develop and make bioinformatics pipelines and data readily available to communities of biologists.
The transcriptome is complex, different libraries of sequence data can filter known sequences (e.g. rRNA) and discover new elements (miRNAs) and isoforms of expressed genes.
Within a dataset, read maps, tables, and histogram plots are needed to summarize and understand the kinds of sequences present and how they relate to an experiment.
Links to Entrez Gene, the USCS genome browser, and miRBASE, show how additional information can be integrated into the application framework and used.
Next Gen transcriptomics assays are similar to microarray assays in many ways, hence software systems like Geospiza’s GeneSifter are useful for comparative analysis.

You can also get the file, AGBT_2009.pdf

Sunday, February 15, 2009

Three Themes from ABRF and AGBT Part I: The Laboratory Challenge

It's been an exciting week on the road at the AGBT and ABRF conferences. From the many presentations and discussions it is clear that the current and future next generation DNA sequencing (NGS) technologies are changing the way we think about genomics and molecular biology. It is also clear that successfully using these technologies impacts research and core laboratories in three significant areas:

The Laboratory: Running successful experiments requires careful attention to detail.
Bioinformatics: Every presentation called out bioinformatics as a major bottleneck. The data are hard to work with and different NGS experiments require different specialized bioinformatics workflows (pipelines).
Information Technology (IT): The bioinformatics bottleneck is exacerbated by IT issues involving data storage, computation, and data transfer bandwidth.

We kicked off ABRF by participating in the Next Gen DNA Sequencing workshop on Saturday (Feb. 7). It was extremely well attended with presentations on experiences in setting up labs for Next Gen sequencing, preparing DNA libraries for sequencing, and dealing with the IT and bioinformatics.

I had the opportunity to provide the “overview” talk. In that presentation “From Reads to Datasets, Why Next Gen is not Sanger Sequencing,” I focused on the kinds of things you can do with NGS technology, its power, and the high level issues that groups are facing today when implementing these systems. I also introduced one of our research projects on developing scalable infrastructures using HDF5 for Next Gen bioinformatics and high-performing, dynamic, software interfaces. Three themes resufraced again and again throughout the day: one must pay attention to laboratory details, bioinformatics is a bottleneck, and don't underestimate the impact of NGS systems on IT.

In this post, I'll discuss the laboratory details and visit the other themes in posts to come.

Laboratory Management

To better understand the impact of NGS on the lab, we can compare it to Sanger sequencing. In the table below, different categories ranging from the kinds of samples, to their preparation, to the data, are considered to show how NGS differs from Sanger sequencing. Sequencing samples for example are very different between Sanger and NGS. In Sanger sequencing, one typically works with clones or PCR amplicons. Each sample (clone or PCR product) produces a single sequence read. Overall, sequencing systems are robust, so the biggest challenges to labs has been tracking the samples as they move from tube to plate or between wells within plates.

In contrast, NGS experiments involve sequencing DNA libraries and each sample produces millions of reads. Presently, only a few samples are sequenced at a time so the sample tracking issues, when compared to Sanger, are greatly reduced. Indeed, one of the significant advantages and cost savings of NGS is to eliminate the need for cloning or PCR amplification in preparing templates to sequence.

Directly sequencing DNA libraries is a key ability and a major factor that makes NGS so powerful. It also directly contributes to the bioinformatics complexity (more on that in the next post). Each one of the millions of reads that are produced from the sample corresponds to an individual molecule, present in the DNA library. Thus, the overall quality of the data and the things you can learn are a direct function of the library.

Producing good libraries requires that you have a good handle on many factors. To begin, you will need to track RNA and DNA concentrations, at different steps of the process. You also need to know the “quality” of the molecules in the sample. For example, RNA assays will give the best results when RNA is carefully prepared and free of RNAses. In RNA-Seq, the best results are obtained when the RNA is fragmented prior to cDNA synthesis. To understand the quality of the starting RNA, fragmentation, and cDNA synthesis steps, tools like agarose gels or Bioanalyzer traces are used to evaluate fragment lengths and determine overall sample quality. Other assays and sequencing projects have similar processes. Throughout both conferences, it was stressed that regardless of whether you are sequencing genomes, small RNAs, performing an RNA-Seq, or other “tag and count” kinds of experiments, you need to pay attention to the details of the process. Tools like the NanoDrop, or QPCR procedure need to be routinely used to measure RNA or DNA concentration. Tools like gels and the Bioanalyzer are used to measure sample quality. And, in many cases both kinds of tools are used.

Through many conversations, it became clear that Bioanalyzer images, Nanodrop reports, and other lab data quickly accumulate during these kinds of experiments. While an NGS experiment is in progress, these data are pretty accessible and the links between data quality and the collected data are easy to see. It only takes a few weeks, however, for these lab data to disperse. They find their way into paper notebooks, or unorganized folders on multiple computers. When the results from one sample need to be compared to another, a new problem appears. It becomes harder and harder to find the lab data that correspond to each sample.

To summarize, NGS technology makes it possible to interrogate large ensembles of individual RNA or DNA molecules. Different questions can be asked by preparing the ensembles (libraries) in different ways involving complex procedures. To ensure that the resulting data are useful, the libraries need to be of high and known quality. Quality is measured with multiple tools at different points of the process to produce multiple forms of laboratory data. Traditional methods such as laboratory notebooks, files on computers, and post-it notes however, make these data hard to find when the time comes to compare results between samples.

Fortunately, the GeneSifter Lab Edition solves these challenges. The Lab Edition of Geospiza’s software platform provides a comprehensive laboratory information management system (LIMS) for NGS and other kinds of genetic analysis assays, experiments, and projects. Using web-based interfaces, laboratories can define protocols (laboratory workflows) with any number of steps. Steps may be ordered and required to ensure that procedures are correctly followed. Within each step, the laboratory can define and collect different kinds of custom data (Nanodrop values, Bioanalyzer traces, gel images, ...). Laboratories using the GeneSifter Lab Edition can produce more reliable information because they can track the details of their library preparation and link key laboratory data to sequencing results.

Monday, October 6, 2008

Sneak Peak: Genetic Analysis From Capillary Electrophoresis to SOLiD

On October 7, 2008 Geospiza hosted a webinar featuring the FinchLab, the only software product to track the entire genetic analysis process, from sample preparation, through processing to analyzed results.

If you are as disappointed about missing it as we are about you missing, no worries. You can get the presentation here.

If you are interested in:

Learning about Next Gen sequencing applications
Seeing what makes the Applied Biosystems SOLiD system powerful for transcriptome analysis, CHiP-Seq, resequenicng experiments, and other applications
Understanding the flow of data and information as samples are converted into results
Overcoming the significant data management challenges that accompany Next Gen technologies
Setting up Next Gen sequencing in your core lab
Creating a new lab with Next Gen technologies

This webinar is for you!

In the webinar, we talked about the general applications of Next Gen sequencing and focused on using SOLiD to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling and whole transcriptome analysis. Throughout the talk we gave specific examples about collecting and analyzing SOLiD data and showed how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Wednesday, August 20, 2008

Next Gen DNA Sequencing Is Not Sequencing DNA

In the old days, we used DNA sequencing primarily to learn about the sequence and structure of a cloned gene. As the technology and throughput improved, DNA sequencing became a tool for investigating entire genomes. Today, with the exception of de novo sequencing, Next Gen sequencing has changed the way we use DNA sequences. We're no longer looking for new DNA sequences. We're using Next Gen technologies to perform quantitative assays with DNA sequences as the data points. This is a different way of thinking about the data and it impacts how we think about our experiments, data analysis, and IT systems.

In de novo sequencing, the DNA sequence of a new genome, or genes from the environment is elucidated. De novo sequencing ventures into the unknown. Each new genome brings new challenges with respect to interspersed repeats, large segmented gene duplications, polyploidy and interchromosomal variation. The high redundancy samples obtained from Next Gen technology lower the cost and speed this process because less time is required for getting additional data to fill in gaps and finish the work.

The other ultra high throughput DNA sequencing applications, on the other hand, focus on collecting sequences from DNA or RNA molecules for which we already have genomic data. Generally called "resequencing," these applications involve collecting and aligning sequence reads to genomic reference data. Experimental information is obtained by tabulating the frequency, positional information, and variation of the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw conclusions.

DNA sequences are information rich data points

EST (expressed sequence tag) sequencing was one of the first applications to use sequence data in a quantitative way. In EST applications, mRNA from cells was isolated, converted to cDNA, cloned, and sequenced. The data from an EST library provided both new and quantitative information. Because each read came from a single molecule of mRNA, a set of ESTs could be assembled and counted to learn about gene expression. The composition and number of distinct mRNAs from different kinds of tissues could be compared and used to identify genes that were expressed at different time points during development, in different tissues, and in different disease states, such as cancer. The term "tag" was invented to indicate that ESTs could also be used to identify the genomic location of mRNA molecules. Although the information from EST libraries was been informative, lower cost methods such as microarray hybridization and real time-PCR assays replaced EST sequencing over time, as more genomic information became available.

Another quantitative use of sequencing has been to assess allele frequency and identify new variants. These assays are commonly known as "resequencing" since they involve sequencing a known region of genomic DNA in a large number of individuals. Since the regions of DNA under investigation are often related to health or disease, the NIH has proposed that these assays be called "Medical Sequencing." The suggested change also serves to avoid giving the public the impression that resequencing is being carried out to correct mistakes.

Unlike many assay systems (hybridization, enzyme activity, protein binding ...) where an event or complex interaction is measured and described by a single data value, a quantitative assay based on DNA sequences yields a greater variety of information. In a technique analogous to using an EST library, an RNA library can be sequenced, and the expression of many genes can be measured at once, by counting the number of samples that align to a given position or reference. If the library is prepared from DNA, a count of the aligned reads could measure the copy number of a gene. The composition of the read data itself can be informative. Mismatches in aligned reads can help discern alleles of a gene, or members of a gene family. In a variation assay, reads can both assess the frequency of a SNP and discover new variation. DNA sequences could be used in quantitative assays to some extent with Sanger sequencing, but the cost and labor requirements prevented wide spread adoption.

Next Gen adds a global perspective and new challenges

The power of Next Gen experiments comes from sequencing DNA libraries in a massively parallel fashion. Traditionally, a DNA library was used to clone genes. The library was prepared by isolating and fragmenting genomic DNA, ligating the pieces to a plasmid vector, transforming bacteria with the ligation products, and growing colonies of bacteria on plates with antibiotics. The plasmid vector would allow a transformed bacterial cell to grow in the presence of an antibiotic so that transformed cells could be separated from other cells. The transformed cells would then be screened for the presence of a DNA insert or gene of interest through additional selection, colorimetric assay (e.g. blue / white), or blotting. Over time, these basic procedures were refined and scaled up in factory style production to enable high throughput shotgun sequencing and EST sequencing. A significant effort and cost in Sanger sequencing came from the work needed to prepare and track large numbers of clones, or PCR-products, for data linking and later retrieval to close gaps or confirm results.

In Next Gen sequencing, DNA libraries are prepared, but the DNA is not cloned. Instead other techniques are used to "separate," amplify, and sequence individual molecules. The molecules are then sequenced all at once, in parallel, to yield large global data sets in which each read represents a sequence from an individual molecule. The frequency of occurrence of a read in the population of reads can now be used to measure the concentration of individual DNA molecules. Sequencing DNA libraries in this fashion significantly lowers costs, and makes previously cost prohibitive experiments possible. It also changes how we need to think about and perform our experiments.

The first change is that preparing the DNA library is the experiment. Tag profiling, RNA-seq, small RNA, ChIP-seq, DNAse hypersensitivity, methylation, and other assays all have specific ways in which DNA libraries are prepared. Starting materials and fragmentation methods define the experiment and how the resulting datasets will be analyzed and interpreted. The second change is that large numbers of clones no longer need to be prepared, tracked, and stored. This reduces the number of people needed to process samples, and reduces the need for robotics, large number of thermocyclers, and other laboratory equipment. Work that used to require a factory setting can now be done in a single laboratory, or mailroom if you believe the ads.

Attention to details counts

Even though Next Gen sequencing gives us the technical capabilities to ask detailed and quantitative questions about gene structure and expression, successful experiments demand that we pay close attention to the details. Obtaining data that are free of confounding artifacts and accurately represent the molecules in a sample, demands good technique and a focus on detail. DNA libraries no longer involve cloning, but their preparation does require multiple steps performed over multiple days. During this process, different kinds of data ranging from gel images to discrete data values, may be collected and used later for trouble shooting. Tracking the experimental details requires that a system be in place that can be configured to collect information from any number and kind of process. The system also needs to be able to link data to the samples, and convert the information from millions of sequence data points to tables, graphics and other representations that match the context of the experiment and give a global view of how things are working. FinchLab is that kind of system.