FinchTalk: Whole Transcriptome Analysis

Showing posts with label Whole Transcriptome Analysis. Show all posts

Wednesday, September 29, 2010

A Genomics Genealogy

Deep sequencing technologies have radically changed how we study biology. Deciding what technology and software to use can be daunting. Choices become easier when the relationships between different DNA sequencing applications are understood.

A brief history

DNA sequencing grew from our desire to understand how the instructions for the biochemistry of life are encoded in an organism’s DNA. If we know the precise ordering and organization of an organism’s DNA sequence, we can presumably unlock a code that reveals these instructions. Accomplishing this goal required the creation of a new field, molecular biology, and new technologies to sequence genes.

The first sequencing methods were arduous. They combined nuclease digestion with thin layer chromatography to measure di- and trinucleotides that could be puzzled together. Later, Maxim and Gilbert replaced enzymatic DNA degradation with a chemical fragmentation method that enabled the reading of ordered bases from ³²P labeled fragments separated by electrophoresis.

The Sanger method, which used dideoxynucleotide triphosphates to create ensembles of DNA molecules terminated at each base, soon replaced Maxim Gilbert sequencing. The next innovation was to color code DNA with fluorescent dyes so that molecules could be interrogated with a laser and camera coupled to a computer. This innovation automated “high-throughput” DNA sequencing systems, initially with polyacrylamide gels and later with capillary electrophoresis, and made it possible to sequence the human and other genomes. It also created the first transcriptome analysis method, Expressed Tag Sequencing (EST).

Despite 20 years of advances, however, the high-throughput sequencing methods were not high-enough-throughput to realistically interrogate DNA and RNA molecules in creative ways. Big questions (genomes, ESTs, meta-genomes) required large factory-like approaches to automate sample preparation and collect sequences because a fundamental problem had yet to be solved. Specially, each sequence was obtained from an individual purified DNA clone or PCR product.

Real high-throughput is massively parallel throughput

The next-generation DNA sequencing (NGS) technologies free researchers from the need to clone or purify every molecule. They all share the common innovation that DNA sequencing is performed in a massively parallel format. That is a library, or ensemble of millions of DNA molecules, are simultaneously sequenced. Data collection costs are dramatically decreased through miniaturization and by eliminating the need for warehouses of colony pickers, prep robots, sequencing instruments, and large teams of people.

The new problem is dealing with the data that are produced and increasing computation costs. As NGS opens new possibilities to measure DNA and RNA in novel ways, each application requires a specific laboratory procedure that must be coupled to a specific analysis methodology.

Sequencing genealogy is defined by the questions

In an evolutionary model, the history of cloning, restriction site mapping, and Sanger sequencing form the trunk of the genomics application tree (top figure) from which branches develop as new applications emerge.

NGS has driven the evolution of three main sequencing branches: De Novo, Functional Genomics, and Variation Assays. The De Novo, or Exploratory, sequencing contains three subbranches that include new genomes (projects that seek to determine a complete genome sequence of an organism), meta-genomes (projects in which DNA fragments are sequenced from environmental samples), or meta-transcriptomes (projects where cDNA fragments are sequenced from environmental samples).

The Functional Genomics branch is growing fast. In these experiments, different collections of RNA or DNA molecules from an organism, tissue, or cells, are isolated and sequenced to measure gene expression and how it is regulated. Three subbranches describe the different kinds of function genomics: Expression, Regulation, and EpiGenomics, and each of these subbranches can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved) that are defined by laboratory protocols, kits, and instruments. When the experiments are refined and are made reproducible, they become assays.

Variation Assays form the third main branch of the tree. Genomic sequences are compared within and between populations to link genotype and phenotype. In special cases like cancer and immunology research, variation assays are used to observe changes within an organism’s somatic genomes over time. Today, variation, or resequencing, assays measure nucleotide and small insertions and deletions in whole genomes and exomes. If linked sequence strategies (mate-pairs, paired-ends) are used, larger structural changes including copy number variations can also be measured.

Why is this important?

As a software provider with both deep lab and analysis experience, we [Geospiza] are often asked questions about what instrument platform is the best or how our software stacks up against other available options. The answer, of course, depends on what you want to do. De Novo applications benefit from long reads offered by platforms like 454. Many of the assay-based applications demand ultra-deep sequencing with very high numbers of sequences (reads) as provided by the short-read platforms (Illumina, SOLiD). New single molecule sequencing platforms like PacBio's are targeting a wide rage of applications but have best been demonstrated, thus far, for long-read uses and novel methylation assays.

From an informatics perspective, the exploratory and assay-based branches have distinct software requirements. Exploratory applications require that reads be assembled into contigs that must be further ordered into scaffolds to get to the complete sequence. In meta-genomics or meta-transcriptomics applications, data are assembled to obtain gene sequences. These projects are further complicated by orthologous and paralogous sequences and highly expressed genes that over represent certain sequences. In these situations, specialized hardware or complex data reduction strategies are needed to make assembly practical. Once data are assembled, they are functionally annotated in a second computational phase using tools like BLAST.

Assay-based data analysis also has two distinct phases, but they are significantly different from De Novo sequencing. The first phase involves aligning (or mapping) reads to reference data sources and then reducing the aligned data into quantitative values. At least one reference is required and the better it is annotated the more informative the initial results will be. Alignment differs from assembly in that reads are separately compared to a reference rather than amongst themselves. Alignment processing capacity can be easily scaled with multiple inexpensive computers whereas assembly processing cannot.

The second phase of Assay-based sequencing is to produce a discrete output as defined by a diagnostic application, or compare the quantitative values computed from the alignments from several samples, obtained from different individuals and (or) treatments relative to controls. This phase requires statistical tools to normalize data, filter false positives and negatives, and measure differences. Assay-based applications become more informative when large numbers of samples and replicates are included in a study.

Connecting the dots

While the sequencing applications can be grouped and summarized in different ways, they are also interrelated. For example, De Novo projects are open-ended and exploratory, but their end product, a well-annotated reference sequence, is the foundation for Functional Genomics and Variation applications. Variation analysis is only useful if we can assign function to specific genotypes. Functional assignments come, in part, from previous experiments and genomic annotations, but are increasingly being produced by sequencing assays, so the new challenge is integrating that data obtained from different assays into coherent datasets that can link many attributes to a set of genotypes.

NGS clearly opens new possibilities for studying and characterizing biological systems. Different applications require different sequencing platforms, laboratory procedures, and software systems that can organize analysis tools and automate data processing. On this last point, as one evaluates their projects and their options for being successful, they need to identify informatics groups that have deep experience, available solutions, and strong capabilities to meet the next challenges. Geospiza is one such group.

Further Reading

DNA Sequencing History

Gilbert W, Maxam A (1973) The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70:3581

Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74:560

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463-7

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321:674-9

Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cdna sequencing (expressed sequence tags) from a directionally cloned human infant brain cdna library. Nat Genet 4:373-80

International Human Genome Sequencing Consortium, 2001. “Initial sequencing and analysis of the human genome.” Nature 409, 860-921.

Venter J.C., Adams M.D., Myers E.W., et. al. 2001. “The sequence of the human genome.” Science 291, 1304-1351.

FinchTalks

From Reads to Datasets Why Next Gen is Not Like Sanger
Expeditiously Exponential: Genome Standards in a New Era

Next Gen DNA Sequencing Is Not Sequencing DNA
Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Wednesday, July 14, 2010

Increasing the Scale of Deep Sequencing Data Analysis with BioHDF

Last month, at the Department of Energy's Sequencing, Finishing and Analysis in the Future meeting, I presented Geospiza's product development work and how BioHDF is contributing to scalable infrastructures. The abstract, presentation, and link to the presentation are posted below.

Abstract

Next Generation DNA Sequencing (NGS) technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. Presently, the value of NGS technology has been largely demonstrated on individual sample analyses. The full potential of NGS will be realized when it can be used in multisample experiments that involve different measurements and include replicates, and controls to make valid statistical comparisons. Arguably, improvements in current technology, and soon to be available “third” generation systems, will make it possible to simultaneously measure 100’s to1000’s of individual samples in single experiments to study transcription, alternative splicing, and how sequences vary between individuals and within expressed genes. However, several bioinformatics systems challenges must be overcome to effectively manage both the volumes of data being produced and the complexity of processing the numerous datasets that will be generated.

Future bioinformatics applications need to be developed on common standard infrastructures that can reduce overall data storage, increase data processing performance, integrate information from multiple sources and are self-describing. HDF technologies meet all of these requirements, have a long history, and are widely used in data-intensive science communities. They consist of general data file formats, software libraries and tools for manipulating the data. Compared to emerging standards such as the SAM/BAM formats, HDF5-based systems demonstrate improved I/O performance and improvedmethods to reduce data storage. HDF5 isalso more extensible and can support multiple data indexes and store multiple data types. For these reasons, HDF5 and its BioHDF implementation are well qualified as standards for implementing data models in binary formats to support the next generation of bioinformatics applications. Through this presentation we will demonstrate BioHDF's latest features in NGS applications that target transcription analysis and resequencing.

SciVee Video

Acknowledgments

Contributing Authors: Todd Smith (1), Christopher E Mason (2), Paul Zumbo (2), Mike Folk (3), Dana Robinson (3), Mark Welsh (1), Eric Smith (1), N. Eric Olson (1),

1. Geospiza, Inc. 100 West Harrison N. Tower 330, Seattle WA 98119 2. Department of Physiology and Biophysics, Weil Cornell Medical College, 1305 York Ave., New York NY, 10021 3. The HDF Group, 1901 S. First St., Champaign IL 61820

Funding: NIH: STTR HG003792

Friday, March 19, 2010

RNA Deep Sequencing - Beyond Proof of Concept

ABRF 2010 begins this weekend. In addition to my LIMS presentation on Sunday, I will present our poster featuring data analysis of sequences from "Sex-specific and lineage-specific alternative splicing in primates" (Blekhman et. al) in GeneSifter Analysis Edition.

The poster number is RP-3. Stop by and see how we learned that not all samples are what they seem to be ...

Abstract

Next Generation DNA Sequencing (NGS) technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. Presently, the value of NGS technology has been largely demonstrated on individual sample analyses (1-3). The full potential of NGS will be realized when it can be used in multisample experiments that involve different measurements and include replicates, and controls to make valid statistical comparisons. Arguably, improvements in current technology, and soon to be available “third” generation systems, will make it possible to simultaneously measure 100’s to1000’s of individual samples in single experiments to study transcription, alternative splicing, and how sequences vary between individuals and within expressed genes. However, several bioinformatics systems challenges must be overcome to effectively manage both the volumes of data being produced and the complexity of processing the numerous datasets that will be generated.

In this poster we present a system that is used it to verify and further characterize previously published data from a gene expression study that includes both replicates and experimental values comparing sex and lineage specific alternative splicing in primates (4). This system, developed on a high performance computing architecture, stores and organizes the data, aligns millions of reads to different reference sequences, identifies and removes artifacts, executes comparative and statistical analyses, and finally links results to pathway and ontological information for making discoveries and confirming hypotheses. The supporting infrastructure includes intuitive user interfaces for organizing data, executing analytical operations, viewing summarized reports, navigating through details in the results and can be easily operated by biologists.

1. Marioni JC, et. al. (2008) Genome Res.

2. Ramsköld D, et. al. (2009) PLoS Comput Biol.

3. Pleasance ED, et. al.(2010) Nature.

4. Blekhman R, et. al. (2009) Genome Res.

Wednesday, February 3, 2010

Sneak Peak: Data Analysis Methods for Whole Transcriptome Sequencing Applications – Challenges and Solutions

RNA sequencing is one of the most popular Next Generation Sequencing (NGS) applications. Next Thursday, February 11 at 10:00 A.M. PDT (1:00 P.M. EDT), we kick off our 2010 webinar series with a presentation designed to help you understand whole transcriptome data analysis and what can be learned in these experiments. In addition, we will show off some of our latest tools and interfaces that can be used to discover new RNAs, new splice forms of transcripts, and alleles of expressed genes.

Summary

RNA sequencing applications such as Whole Transcriptome Analysis, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 500 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample (gene level expression summaries, exon usage, splice junction, single nucleotide variants, insertions and deletions), these applications are also ideal for the identification of novel RNAs as well as novel splicing events.

This presentation will provide an overview of Whole Transcriptome data analysis workflows with emphasis on calculating gene and exon level expression values as well as identifying splice junctions and variants from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and Lifetech’s SOLiD instruments.

Register Today!

Thursday, December 31, 2009

2009 Review

The end of the year is a good time to reflect, review accomplishments, and think about the year to come. 2009 was a good year for Geospiza’s customers, with many exciting accomplishments for the company. Highlights are reviewed below.

Two products form a complete genetic analysis system

Geospiza’s two core products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), help laboratories do their work and scientists analyze their data. GSLE is the LIMS (Laboratory Information Management System) that laboratories, from service labs to high-throughput data production centers, use to collect information about samples, track and manage laboratory procedures, organize and process data, and deliver data and results back to researchers. GSLE supports traditional DNA sequencing (Sanger), fragment analysis, genotyping, microarrays, Next Generation Sequencing (NGS) and other technologies.

In 2008, Geospiza released the third version of the platform (back then it was known as FinchLab). This version launched a new way of providing LIMS solutions. Traditional LIMS systems require extensive programming and customization to meet a laboratory’s specific requirements. They include a very general framework designed to support a wide range of activities. Their advantage is that they are highly customizable. However, this advantage comes at the expense of very high acquisition costs accompanied by lengthy requirements planning and programming before they become operational.

In contrast, GSLE contains default settings that support genetic analysis out-of-the-box, while allowing laboratories to customize operations without programmer support. Default settings in GSLE suppport DNA sequencing, microarray, and genotyping services. The GSLE abstraction layer supports extensive configuration to meet specific needs as they arise. Through this design, the costs of acquiring and operating a high-quality advanced LIMS system are significantly reduced.

Throughout 2009, 100’s of features were added to GSLE to increase support for instruments and data types, and improve how laboratory procedures (workflows) are created, managed, and shared. Enhancements were made to features like experiment ordering, organization, and billing. We also added new application programming interfaces (APIs) to enable integration with enterprise software. Specific highlights included:

Extending microarray support to include sample sheet generation and automate uploading files
Improving NGS file and data browsing to simplify the process of searching and viewing the 1000’s of files produced in Next Gen sequencing runs
Making NGS data downloads, of very large gigabase files, robust and easy
Adding worksets to group DNA and RNA samples in customized ways that facilitate laboratory processing
Creating APIs to utilize external password servers and programmatically receive data using GSLE form objects
Enhancing ways for groups to add HTML to pages to customize their look and feel

In addition to the above features, we’ve also completed development on methods to multiplex NGS samples and track MIDs (molecular identifiers and molecular barcodes), enter laboratory data like OD values and bead counts in batches, create orders with multiple plates, and access SQL queries through an API. Look for these great features and more in the early part of 2010.

GSAE

As noted, GSAE is Geospiza’s data analysis product. While GSLE is capable of running of running advanced data analysis pipelines, the primary focus of data analysis in GSLE is to provide quality control. Thus its data analyses and presentation focus on single samples. GSAE provides the infrastructure and tools to compare the results between samples. In the case of NGS, GSAE also provides more reports and data interactions. GSAE began as a web-based microarray data analysis platform making it well suited for NGS-based gene expression assays. Over 2009 many new features were added to extend its utility to NGS data analysis with a focus on whole transcriptome analysis. Highlights included:

Developing data analysis pipelines for RNA-Seq, Small RNA, ChIP-Seq, and other kinds of NGS assays
Adding tools to visualize and discover alternatively spliced transcripts in gene expression assays
Extending expression analysis tools to include interactive volcano plots, unbalanced two-way ANOVA computations
Increasing NGS transcriptome analysis capabilities to include variation detection and visualization

The above features fulfill the requirements needed to make a platform complete for both NGS and microarray-based gene expression analysis. And, the addition of variation detection and visualization lays the groundwork for GSAE to extend its market leadership to resequencing data analysis.

Geospiza Research

In 2009 Geospiza won two research awards in the form of Phase II STTR and Phase I SBIR grants. The STTR project is researching new ways to organize, compress, and access NGS data by adapting HDF technologies to bioinformatics. Through this work we are developing a robust data management infrastructure that supports our NGS sequencing analysis pipelines and interactive user interfaces. The second award targets NGS-based variation detection. This work began in the last quarter of the year, but is already delivering new ways to identify and visualize variants in RNA-Seq and whole transcriptome analysis.

To learn more about our progress in 2009, visit our news page. It includes our press releases and reports in the news, publications citing our software, and webinars where we have presented our latest and greatest.

As we close 2009, we especially want to thank our customers and collaborators for their support in making the year successful and we look forward to an exciting year ahead in 2010.

Wednesday, September 23, 2009

GeneSifter in Current Protocols

This month we are pleased to report Geospiza's publication of the first standard protocols for analyzing Next Generation Sequencing (NGS) data. The pulication, appearing in the September issue of Current Protocols, addresses how to analyze data from both microarray, and NGS experiments. The abstract and links to the paper and our press release are provided below.

Abstract

Transcription profiling with microarrays has become a standard procedure for comparing the levels of gene expression between pairs of samples, or multiple samples following different experimental treatments. New technologies, collectively known as next-generation DNA sequencing methods, are also starting to be used for transcriptome analysis. These technologies, with their low background, large capacity for data collection, and dynamic range, provide a powerful and complementary tool to the assays that formerly relied on microarrays. In this chapter, we describe two protocols for working with microarray data from pairs of samples and samples treated with multiple conditions, and discuss alternative protocols for carrying out similar analyses with next-generation DNA sequencing data from two different instrument platforms (Illumina GA and Applied Biosystems SOLiD).

In the chapter we cover the following protocols:

Basic Protocol 1: Comparing Gene Expression from Paired Sample Data Obtained from Microarray Experiments
Alternate Protocol 1: Compare Gene Expression from Paired Samples Obtained from Transcriptome Profiling Assays by Next-Generation DNA Sequencing
Basic Protocol 2: Comparing Gene Expression from Microarray Experiments with Multiple Conditions
Alternate Protocol 2: Compare Gene Expression from Next-Generation DNA Sequencing Data Obtained from Multiple Conditions

Links

To view the abstract, contents, figures, and literature cited online visit: Curr. Protoc. Bioinform. 27:7.14.1-7.14.34

To view the press release visit: Geospiza Team Publishes First Standard Protocol for Next Gen Data Analysis

Saturday, September 12, 2009

Sneak Peak: Sequencing the Transcriptome: RNA Applications for Next Generation Sequencing

Join us this coming Wednesday, September 16, 2009 10:00 am Pacific Daylight Time (San Francisco, GMT-07:00), for a webinar on whole transcriptome analysis. In the presentation you will learn about how GeneSifter Analysis Edition can be used to identify novel RNAs and novel splice events within known RNAs.

Abstract:

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling, Whole Transcriptome Sequencing and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA applications using data from the NCBI's GEO database and Short Read Archive with an emphasis on converting raw data into biologically meaningful datasets. Data analysis examples will focus on methods for identifying differentially expressed genes, novel genes, differential splicing and 5’ and 3’ variation in miRNAs.

To register, please visit the event page.