FinchTalk: June 2009

Monday, June 22, 2009

Sneak Peak: RNA-Seq - Global Profiling of Gene Activity and Alternative Splicing

Join us June 30 at 10:00 am PDT. Eric Olson, Geospiza's VP of Product Development will present an interesting webinar on using RNA-Seq to measure gene expression and discover alternatively spliced messages using GeneSifter Analysis Edition.

Abstract

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA-Seq data analysis process with emphasis on calculating gene and exon level expression values as well as identifying splice junctions from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and ABI’s SOLiD instruments.

To register visit the Geospiza webex event page.

Monday, June 15, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part III: The HDF5 Advantage

The Next Generation DNA Sequencing (NGS) bioinformatics bottleneck is related to the complexity of working with the data, analysis programs, and numerous output files that are produced as the data are converted from images to final results. Current systems lack well-organized data models and corresponding infrastructures for storing data and analysis information resulting in significant levels of data processing, reprocessing, and data copying redundancies. Systems can be improved with data management technologies like HDF5.

In this third installment of the bloginar, results from our initial work with HDF5 are presented. Previous posts have provided an introduction to the series and background on NGS.

Working with NGS data

With the exception of de novo sequencing, in which novel genomes, or transcriptomes, are analyzed, many NGS applications can be thought of as quantitative assays where DNA sequences are highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation of the bases between sequences in the alignments. Data tables from samples that differ by experimental treatment, environment, or populations, are compared in different ways to make additional discoveries and draw conclusions. Whether the assay is to measure gene expression, study small RNAs, understand gene regulation, or quantify genetic variation, a similar process is followed:

A large dataset of sequence reads are collected in a massively parallel format.
The reads are aligned to reference data.
Alignment data, stored in multiple kinds of output files, are parsed, reformatted, and computationally organized to create assay specific reports.
The reports are reviewed and decisions are made for how to work with the next sample, experiment, or assay.

Current practices where multiple programs create different kinds of information, formatted in different ways make this process difficult even for single samples. Achieving our main goal of comparing the analysis for multiple samples at a time is harder still. Presently, the four steps listed above must be repeated for each sample and multiple reports, that list expression values for single genes, or describe positional frequencies of read density, must be combined in some fashion to create new views to summarize or compute the differences and similarities between datasets. In the case of gene expression, for example, volcano plots can be used to compare the observed changes in gene expression with the likelihood that the observed changes are statistically significant. For a given gene, one might also want to drill into details that show how the reads align to the gene’s reference sequence. Further, the alignments from different samples for the gene need to be compared to see if there is evidence of alternative splicing or other interesting features.

Creating software that allows one to view NGS results in single and multi sample contexts, drill into multiple levels of detail, operates quickly and smoothly, and makes it possible for IT administrators and PIs to predict development and research costs, requires us to store raw data and corresponding analysis results in structures that support the computational needs for the problems being addressed. To accomplish this goal, we can either develop a brand new infrastructure to support our technical requirements and build new software to support our applications, or we can build software on an existing infrastructure and benefit from the experience gained in solving similar problems in other scientific fields.

Geospiza is following the latter path and is using an open-source technology, HDF5 (hierarchical data format), to develop highly scalable bioinformatics applications. Moreover, as we have examined past practices and consider our present and future challenges, we have concluded that technologies like HDF5 have great benefits for the bioinformatics field. Toward this goal, Geospiza has initiated a research program collaborating with The HDF Group to develop extensions to HDF5 that meet specific requirements for genetic analysis.

HDF5 Advantages

Introducing a new technology into an infrastructure requires work. Existing tools need to be refactored and new development practices must be learned. The cost of switching over to new technology has direct development costs associated with refactoring tools and learning new environments as well as a time lag between learning the system and producing new features. Justifying such a commitment demands a return on investment. Hence, the new technology must offer several advantages over current practices, such as improved systems performance and (or) new capabilities that are not easily possible with existing approaches. HDF5 offers both.

With HDF5 technology, we will be able to create better performing NGS data storage and high performance data processing systems, and approach data analysis problems differently.

We'll consider system performance first. Current NGS systems store reads and associated data primarily in text-based flat files. Additionally, the vast majority of alignment programs also store data in text-based flat files, creating the myriad of challenges described earlier. When these data are, instead, stored in HDF5, a number of improvements can be acheived. Because the HDF5 software library and file format can store data as compressed “chunks,” we can reduce storage requirements and access subsets of data more efficiently. For example, read data can be stored in arrays making it possible to quickly compute values like nucleotide frequency statistics for each base position in the reads from an entire multimillion read dataset.

In the example presented, 9.3 million Illumina GA reads were stored in HDF5 as a compressed two dimensional array resulting in a four fold reduction in size when compared to the original fasta formatted file. When the reads were aligned to a human genome reference, the flat file system grew from 609 MB to 1033 MB. The HDF5-based system increased in size by 230 MB to a total of 374 MB for all data and indices combined. In this simple example, the storage benefits of HDF5 are clear.

We can also demonstrate the benefits of improving the efficiency of accessing data. A common bioinformatics scenario is to align a set of sequences (queries) to a set of reference sequences (subjects) and then examine how the query sequences compare to the subject sequence within a specific range. Software routines accomplish this operation by getting the name (or ID) of a subject sequence along with the beginning and ending positions of the desired range(s). This information is used to first search the set of alignments for the names (or IDs) of the query sequences that match and query’s beginning and ending positions that match in the alignment. Next, the dataset of query sequences is searched to retrieve the matching data. When the data are stored in a non-indexed flat file, the entire file must be read to find the matching sequences. This takes, on average, half of the time needed to read the entire file. In contrast, indexed data can be accessed in a significantly reduced amount of time. The shorter time derives from two features: 1. A smaller amount of data needs to be read to conduct the search, and 2. Structured indices make searches more efficient.

In our example, the 9.3 million reads produced many millions of alignments when the data were compared to the human genome. We tested the performance for retrieving read alignment information from different kinds of file systems by accessing the alignments from successively smaller regions of chromosome 5. The entire chromosome contained roughly one million alignments. Retrieving the reads from the entire chromosome was slightly more efficient in HDF5 than retrieving the same data from the flat file system. However, as fewer reads were retrieved from smaller regions, the HDF5-based system demonstrated significantly better performance. For HDF5, the time to retrieve reads decreased as a function of the amount of data being retrieved down to 15 ms, the minimum overhead of running the program that accesses the HDF5 file. When compared to the minimum access time for the flat file (735 ms), a ~50 fold improvement is observed. As datasets continue to grow, the overhead for using the HDF5 system will remain at 15 ms, whereas the overhead for flat file systems will continue to increase.

The demonstrated performance advantages are not unique to HDF5. Similar results can be achieved by creating a data model to store the reads and alignments and implementing the model in a binary file format with indices to access the stored data in random ways. A significant advantage of HDF5 is that the software to implement the data models, build multiple kinds of indices, compress data in chunks, and read and write the data to and from the file has already been built, debugged, and supported by over 20 years of development. Hence, one of the most significant performance advantages associated with using the HDF platform is the savings in development time. To reproduce a similar, perhaps more specialized, system would require many months (even years) to develop, test, document, and refine the low-level software needed to make the system well-performing, highly scalable, and broadly usable. In our experience with HDF5, we’ve been able to learn the system, implement our data models, and develop the application code in a matter of weeks.

Consequently, we are spending more of our time solving the interesting challenges associated with analyzing millions of NGS reads from 100's or 1000's of samples to measure gene expression, identify alternatively spliced and small RNAs, study regulation, calculate sequence variation, and link summarized data to its underlying details, and we are spending a much smaller fraction of our time optimizing low-level infrastructures.

Additional examples of how HDF5 is changing our thinking will be presented next.

Thursday, June 11, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part II: Background

Next Generation DNA Sequencing (NGS) technologies will continue to produce ever increasing amounts of complex data. In the old world of electrophoresis-based sequencing (a.k.a. Sanger) we could effectively manage projects and data growth by relying largely on improvements in computer hardware. NGS breaks this paradigm, we now must think more deeply about software and hardware systems as a whole solution. Bringing technologies, like HDF5, into the bioinformatics world provides a path to such solutions. In this installment of the bloginar (started with the previous post), the power and challenges associated with NGS data analysis are discussed.

The Power of Next Generation DNA Sequencing

Over the past couple of years, 100’s of journal articles demonstrating the power of NGS have been published. These articles have been accompanied by numerous press reports communicating the community’s excitement and selected quotes from those articles convey the transformational nature of these technologies. Experiments that, only a short time ago, required a genome center-styled approach can now be done by a single lab.

Transcriptome profiling and digital gene expression are tewo examples that illustrate the power of NGS. For a few thousand dollars, a single run of a SOLiD or Illumina sequencer can produce more ESTs (Expressed Sequence Tags) than exist in all of dbEST. Even more striking is that the data in dbEST were collected over many years and 100's million dollars were invested in getting the sequences. Also, it is important to understand that the data in dbEST represent only a relatively small number of molecules from a broad range of organisms and tissues. While dbEST has been useful for providing a glimpse into the complex nature of gene expression and, in the case of human, mouse and a few other organisms, mapping a reasonable number of genes, the resource is unable to provide the deeper details and insights into alternative splicing, antisense transcription, small RNA expression, and allelic variation that can only be achieved when a cell’s complement of RNA is sampled deeply. As recent literature demonstrates, each time an NGS experiment is conducted, we seem to get a new view about the complexity of gene expression and how genomes are transcribed. A similar story repeats when NGS is compared other gene expression measurement systems like microarrays.

Of course with the excitement about NGS, comes the frightening reports about how to handle the data. Just as one can go into the news and scientific literature to get great quotes about NGS power, one can get similar kinds of quotes describing the challenges associated with the increase in data production. Metaphors of fire hoses spraying data often accompany graphs showing exponential growth in data production and accumulation. When these growth curves are compared to Moore’s law, we learn that the data are accumulating at faster rates than improvements in hardware performance. So, those who plan to wait for data management and analysis problems to be solved by better computers, will have to wait a long time.

Analyzing data is harder than it looks

A significant portion of the NGS data problem discussion focuses on how to store the large amounts of data. While a major concern, there are even greater challenges with analyzing the data, this is often referred to as the “bioinformatics bottleneck.” To understand the reason for the bottleneck, we need to understand data NGS analysis. It is common to divide the high-level NGS analysis workflow that converts images into bases, and bases into results into three phases. In the first phase (primary analysis), millions of individual images are analyzed to create files containing millions of sequence reads. The resulting reads are then converted to biologically meaningful information, either by aligning the reads to reference data or assembling the reads into contigs (de novo sequencing). This second phase (secondary analysis) is application specific; each kind of sequencing experiment or assay requires a specific workflow in which reads are compared to different reference sources in multi-step iterative processes. The third phase (tertiary) consists of analyzing multiple datasets of information derived from the second phase in order to annotate genomes, measure gene expression, observe linked variants, or uncover epigenetic patterns.

The bioinformatics bottleneck stems primarily from the complexity of secondary analysis. Secondary analyses use several different alignment programs to compare reads to reference data. Each alignment program produces multiple output files, each containing a subset of information that must be interpreted to understand what the data mean. The output files are large in their own right. A few million reads, for example, can produce many tables of alignment data consisting of many millions of rows. It is not possible to get meaning out of such files without additional software to “read” the data. And, no two alignment programs produce data in a common format. To summarize, secondary analysis utilizes multiple programs, each program produces a varied number of files with data in different formats, and the files are too large and complex to be understood by reading the data by eye. More software is needed.

From a computational and software engineering stand point, there is room for improvement. When the common secondary analysis processes are analyzed further, we see that the many large hard to read files are the tip of the iceberg. The lack of well-organized data models and infrastructure for storing structured information results in significant levels of data processing and data copying redundancy. This is due to the need to represent common information in different contexts. The problem is, that with data throughput scaling faster than Moore’s law, maintaining systems with this kind of complexity and redundancy becomes unpredictably costly in terms of system performance, storage needs, and software development time. Ultimately, it limits what we can do with the data and the questions that can be asked.

The next post discusses how this problem can be addressed.

Sunday, June 7, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part I: Introduction

At the end of May, the DOE's Los Alamos National Laboratory hosted its 4th annual Sequencing, Finishing and Analysis meeting in Santa Fe, New Mexico. We participated in the conference by presenting our work with using HDF5 to develop scalable software for Next Generation DNA Sequencing (NGS) analysis.

Over the next few posts I will share the slides from the presentation. This post begins with the abstract.

Abstract

“If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow” was the closing statement in the lead Nature Biotechnology editorial “Prepare for the deluge” (Oct. 2008). The oft-stated challenges focus on the obvious problems of storing and analyzing data. However, the problems are much deeper than the short descriptions portray. True, researchers are ill-prepared to confront the challenges of inadequate IT infrastructures, but there is a greater challenge in that there is a lack of easy to use, well-performing software systems and interfaces that would allow to researchers to work with data in multiple ways to summarize information and drill down into supporting details.

Meeting the above challenge requires that we have well performing software frameworks and underlying data management tools to store and organize data in better ways than complex mixtures of flat files and relational databases. Geospiza and The HDF Group are collaborating to develop open-source, portable, scalable, bioinformatics technologies based on HDF5 (Hierarchical Data Format – http://www.hdfgroup.org). We call these extensible domain-specific data technologies “BioHDF.” BioHDF will implement a data model that supports primary DNA sequence information (reads, quality values, meta data) and the results from sequence alignment and variation detection algorithms. BioHDF will extend HDF5 data structures and library routines with new features (indexes, additional compression, graph layouts) to support the high performance data storage and computation requirements of Next Gen Sequencing.

For close to 20 years, HDF data formats and software infrastructure have been used to manage and access high volume complex data in hundreds of applications, from flight testing to global climate research. The BioHDF effort is leveraging these strengths. We will show data from small RNA and gene expression analyses that demonstrate HDF5’s value for reducing the space, time, bandwidth, and development costs associated with working with Next Gen Sequence data.

The next posts will cover:

Why NGS is exiting and challenges that can be overcome with HDF5
What the BioHDF project is and some examples of what we are doing with HDF5
Some background on HDF5 (Hierarchical Data Format)

Monday, June 1, 2009

Sneak Peak: Gene Expression Profiling in Cancer Research: from Raw Data to Biological Significance

Join us this Wednesday, 10:00 am pacific time, for a webinar in which we present a comparative analysis of microarray and next generation DNA sequencing using data sets from skin cancer gene expression studies.

Abstract

Current gene expression technologies allow biomedical researchers to examine the expression of tens of thousands of genes at once, giving researchers the opportunity to examine expression for an entire genome, where previously they could only look at a handful of genes at one time.

This presentation will provide an overview of the gene expression data analysis process with emphasis on comparison statistics, correction for multiple testing and determining the biological significance of the results. Using data drawn from the GEO data repository, gene expression in skin cancers will be examined using both Affymetrix GeneChip and Illumina GA Tag Profiling data.

To register, visit the GeneSifter meeting center.