Tuesday, April 13, 2010

Bloginar: Standardizing Bioinformatics with BioHDF (HDF5)

Yesterday we (The HDF Group and Geospiza) released the BioHDF prototype software.  To mark the occasion, and demonstrate some of BioHDF’s capabilities and advantages, I share the poster we presented at this year’s AGBT (Advances in Genome Biology and Technology) conference.

The following map guides the presentation. The poster has a title and four main sections, which cover background information, specific aspects of the general Next Generation Sequencing (NGS) workflow, and HDF5’s advantages for working with large amounts of NGS data.
 
Section 1.  The first section introduces HDF5 (Hierarchical Data Format) as a software platform for working with scientific data.  The introduction begins with the abstract and lists five specific challenges created by NGS: 1) high end computing infrastructures are needed to work with NGS data, 2) NGS data analysis involves complex multi-step processes that, 3) compare NGS data to multiple reference sequence databases, 4) the resulting datasets of alignments must be visualized in multiple ways, and 5) scientific knowledge is gained when many datasets are compared. 

Next, choices for managing NGS data are compared in a four category table.  These include text and binary formats. While text formats (delimited and XML) have been popular for bioinformatics, they do not scale well and binary formats are gaining in popularity. The current bioinformatics binary formats are listed (bottom left) along with a description of their limitations. 

The introduction closes with a description of HDF5 and its advantages for supporting NGS data management and analysis. Specifically, HDF5 is platform for managing scientific data. Such data are typically complex and consist of images, large multi-dimensional arrays, and meta data. HDF5 has been used for over 20 years in other data intensive fields; it is robust, portable, and tuned for high performance computing. Thus HDF5 is well suited for NGS. Indeed, groups from academic researchers to NGS instrument vendors, and software companies are recognizing the value of HDF5.
Section 2. This section illustrates how HDF5 facilitates primary data analysis. First we are reminded that NGS data are analyzed in three phases: primary analysis, secondary analysis and tertiary analysis. Primary analysis is the step that converts images to reads consisting of basecalls (or colors, or flowgrams), and quality values. In secondary analysis, reads are aligned to reference data (mapped) or amongst themselves (assembled). In many NGS assays, secondary analysis produces tables of alignments that must be compared to one and other, in tertiary analysis, to gain scientific insights. 

The remaining portion of section 2 shows how Illumina GA and SOLiD primary data (reads and quality values) can be stored in BioHDF and later reviewed using the BioHDF tools and scripts.  The resulting quality graphs are organized into three groups (left to right) to show base composition plots, quality value (QV) distribution graphs, and other summaries.

Base composition plots show the count of each base (or color) that occurs at a given position in the read. These plots are used to assess overall randomness of a library and observe systematic nucleotide incorporation errors or biases.

Quality value plots show the distribution of QVs at each base position within the ensemble of reads. As each NGS run produces many millions of reads, it is worthwhile summarizing QVs in multiple ways. The first plots, from the top, show the average QV per base with error bars indicating QVs that are within one standard deviation of the mean. Next, box and whisker plots show the overall quality distribution (median, lower and upper quartile, minimum and maximum values) at each position. These plots are followed by “error” plots which show the total count of QVs below certain thresholds (red, QV < 10; green QV < 20; blue, QV < 30). The final two sets of plots show the number of QVs at each position for all observed values and the number of bases having each quality value.

The final group of plots show overall dataset complexity, GC content (base space only), average QV/read, and %GC vs average QV (base space only).  Dataset complexity is computed by determining the number of times a given read exactly matches other reads in the dataset. In some experiments, too many identical reads indicates a problem like PCR bias. In other cases, like tag profiling, many identical reads are expected from highly expressed genes. Errors in the data can artificially increase complexity.
Section 3.  Primary data analysis gives us a picture of how well the samples were prepared or how well the instrument ran with some indication about sample quality. Secondary and tertiary analysis tell us about sample quality and more importantly, provides biological insights. The third section focuses on secondary and tertiary analysis and begins with a brief cartoon showing a high level data analysis workflow using BioHDF to store primary data, alignment results, and annotations. BioHDF tools are used to query these data and other software within GeneSifter is used to compare data between samples and display the data in interactive reports to examine the details from single or multiple samples.

The left side of this section illustrates what is possible with single samples. Beginning with a simple table that indicates how many reads align to each reference sequence, we can drill into multiple reports that provide increasing detail about the alignments. For example, the gene list report (second from top) uses gene model annotations to summarize the alignments for all genes identified in the dataset. Each gene is displayed as a thumbnail graphic that can be clicked to see greater detail, which is shown in the third plot. The Integrated Gene View not only shows the density of reads across the gene's genomic region, but also shows evidence of splice junctions, and identified single base differences (SNVs) and small insertions and deletions (indels). Navigation controls provide ways to zoom into and out of the current view of data, and move to new locations. Additionally, when possible, the read density plot is accompanied by an Entrez gene model and dbSNP data so that data can be observed in a context of known information. Tables that describe the observed variants follow. Clicking on a variant drills into the alignment viewer to show the reads encompassing the point of variation.

The right side illustrates multi-sample analysis in GeneSifter. In assays like RNA-Seq, alignment tables are converted to gene expression values that can be compared between samples. Volcano (top) and other plots are used visualize the differences between the datasets. Since each point in the volcano plot represents the difference in expression for a gene between two samples (or conditions), we can click on that point to view the expression details for that gene (middle) in the different samples. In the case of RNA-Seq, we can also obtain expression values for the individual exons with the gene, making it possible to observe differential exon levels in conjunction with overall gene expression levels (middle). Clicking the appropriate link in the exon expression bar graph, takes us to the alignment details for the samples being analyzed (bottom), in this example we have two cases and two control replicates. Like the single sample Integrated Gene Views, annotations are displayed with alignment data. When navigation buttons are clicked all of the displayed genes move together so that you can explore the gene's details and surrounding neighborhood for multiple samples in a comparative fashion.
Section 4.  The poster closes with details about BioHDF.  First, the data model is described. An advantage of the BioHDF model is that read data are organized non-redundantly. Other formats, like BAM, tend to store reads with alignments and if a read has multiple alignments in a genome, or is aligned to multiple reference sequences, it gets stored multiple times. This may seem trivial, but anything that can happen a million times, becomes noticeable. This fact is demonstrated in the in table listed in the second panel “High Performance Computing Advantages.”  Other HDF5 advantages are listed below the performance stats table.  Most notably is HDF5’s ability to easily support multiple indexing schemes like nested containment lists (NClists). NClists solve the problem of efficiently accessing reads from alignments that may be contained in other alignments, which I will save for a later post.

Finally, the poster is summarized with a number of take home points. These reiterate the fact that NGS is driving the need to use binary file formats to manage NGS and analysis results and that HDF5 provides an attractive solution because of its long history and development efforts that specifically target scientific programming requirements. In our hands, HDF5 has helped make GeneSifter a highly scalable and interactive web-application with less development effort than would have been needed to implement other technologies.  

If you are software developer and are interested in BioHDF please visit www.biohdf.org.  If you do not want to program and instead, want a way to easily analyze your NGS data to make new discoveries, please contact us

No comments: