FinchTalk: Standards

Showing posts with label Standards. Show all posts

Tuesday, April 13, 2010

Bloginar: Standardizing Bioinformatics with BioHDF (HDF5)

Yesterday we (The HDF Group and Geospiza) released the BioHDF prototype software. To mark the occasion, and demonstrate some of BioHDF’s capabilities and advantages, I share the poster we presented at this year’s AGBT (Advances in Genome Biology and Technology) conference.

The following map guides the presentation. The poster has a title and four main sections, which cover background information, specific aspects of the general Next Generation Sequencing (NGS) workflow, and HDF5’s advantages for working with large amounts of NGS data.

Section 1. The first section introduces HDF5 (Hierarchical Data Format) as a software platform for working with scientific data. The introduction begins with the abstract and lists five specific challenges created by NGS: 1) high end computing infrastructures are needed to work with NGS data, 2) NGS data analysis involves complex multi-step processes that, 3) compare NGS data to multiple reference sequence databases, 4) the resulting datasets of alignments must be visualized in multiple ways, and 5) scientific knowledge is gained when many datasets are compared.

Next, choices for managing NGS data are compared in a four category table. These include text and binary formats. While text formats (delimited and XML) have been popular for bioinformatics, they do not scale well and binary formats are gaining in popularity. The current bioinformatics binary formats are listed (bottom left) along with a description of their limitations.

The introduction closes with a description of HDF5 and its advantages for supporting NGS data management and analysis. Specifically, HDF5 is platform for managing scientific data. Such data are typically complex and consist of images, large multi-dimensional arrays, and meta data. HDF5 has been used for over 20 years in other data intensive fields; it is robust, portable, and tuned for high performance computing. Thus HDF5 is well suited for NGS. Indeed, groups from academic researchers to NGS instrument vendors, and software companies are recognizing the value of HDF5.

Section 2. This section illustrates how HDF5 facilitates primary data analysis. First we are reminded that NGS data are analyzed in three phases: primary analysis, secondary analysis and tertiary analysis. Primary analysis is the step that converts images to reads consisting of basecalls (or colors, or flowgrams), and quality values. In secondary analysis, reads are aligned to reference data (mapped) or amongst themselves (assembled). In many NGS assays, secondary analysis produces tables of alignments that must be compared to one and other, in tertiary analysis, to gain scientific insights.

The remaining portion of section 2 shows how Illumina GA and SOLiD primary data (reads and quality values) can be stored in BioHDF and later reviewed using the BioHDF tools and scripts. The resulting quality graphs are organized into three groups (left to right) to show base composition plots, quality value (QV) distribution graphs, and other summaries.

Base composition plots show the count of each base (or color) that occurs at a given position in the read. These plots are used to assess overall randomness of a library and observe systematic nucleotide incorporation errors or biases.

Quality value plots show the distribution of QVs at each base position within the ensemble of reads. As each NGS run produces many millions of reads, it is worthwhile summarizing QVs in multiple ways. The first plots, from the top, show the average QV per base with error bars indicating QVs that are within one standard deviation of the mean. Next, box and whisker plots show the overall quality distribution (median, lower and upper quartile, minimum and maximum values) at each position. These plots are followed by “error” plots which show the total count of QVs below certain thresholds (red, QV < 10; green QV < 20; blue, QV < 30). The final two sets of plots show the number of QVs at each position for all observed values and the number of bases having each quality value.

The final group of plots show overall dataset complexity, GC content (base space only), average QV/read, and %GC vs average QV (base space only). Dataset complexity is computed by determining the number of times a given read exactly matches other reads in the dataset. In some experiments, too many identical reads indicates a problem like PCR bias. In other cases, like tag profiling, many identical reads are expected from highly expressed genes. Errors in the data can artificially increase complexity.

Section 3. Primary data analysis gives us a picture of how well the samples were prepared or how well the instrument ran with some indication about sample quality. Secondary and tertiary analysis tell us about sample quality and more importantly, provides biological insights. The third section focuses on secondary and tertiary analysis and begins with a brief cartoon showing a high level data analysis workflow using BioHDF to store primary data, alignment results, and annotations. BioHDF tools are used to query these data and other software within GeneSifter is used to compare data between samples and display the data in interactive reports to examine the details from single or multiple samples.

The left side of this section illustrates what is possible with single samples. Beginning with a simple table that indicates how many reads align to each reference sequence, we can drill into multiple reports that provide increasing detail about the alignments. For example, the gene list report (second from top) uses gene model annotations to summarize the alignments for all genes identified in the dataset. Each gene is displayed as a thumbnail graphic that can be clicked to see greater detail, which is shown in the third plot. The Integrated Gene View not only shows the density of reads across the gene's genomic region, but also shows evidence of splice junctions, and identified single base differences (SNVs) and small insertions and deletions (indels). Navigation controls provide ways to zoom into and out of the current view of data, and move to new locations. Additionally, when possible, the read density plot is accompanied by an Entrez gene model and dbSNP data so that data can be observed in a context of known information. Tables that describe the observed variants follow. Clicking on a variant drills into the alignment viewer to show the reads encompassing the point of variation.

The right side illustrates multi-sample analysis in GeneSifter. In assays like RNA-Seq, alignment tables are converted to gene expression values that can be compared between samples. Volcano (top) and other plots are used visualize the differences between the datasets. Since each point in the volcano plot represents the difference in expression for a gene between two samples (or conditions), we can click on that point to view the expression details for that gene (middle) in the different samples. In the case of RNA-Seq, we can also obtain expression values for the individual exons with the gene, making it possible to observe differential exon levels in conjunction with overall gene expression levels (middle). Clicking the appropriate link in the exon expression bar graph, takes us to the alignment details for the samples being analyzed (bottom), in this example we have two cases and two control replicates. Like the single sample Integrated Gene Views, annotations are displayed with alignment data. When navigation buttons are clicked all of the displayed genes move together so that you can explore the gene's details and surrounding neighborhood for multiple samples in a comparative fashion.

Section 4. The poster closes with details about BioHDF. First, the data model is described. An advantage of the BioHDF model is that read data are organized non-redundantly. Other formats, like BAM, tend to store reads with alignments and if a read has multiple alignments in a genome, or is aligned to multiple reference sequences, it gets stored multiple times. This may seem trivial, but anything that can happen a million times, becomes noticeable. This fact is demonstrated in the in table listed in the second panel “High Performance Computing Advantages.” Other HDF5 advantages are listed below the performance stats table. Most notably is HDF5’s ability to easily support multiple indexing schemes like nested containment lists (NClists). NClists solve the problem of efficiently accessing reads from alignments that may be contained in other alignments, which I will save for a later post.

Finally, the poster is summarized with a number of take home points. These reiterate the fact that NGS is driving the need to use binary file formats to manage NGS and analysis results and that HDF5 provides an attractive solution because of its long history and development efforts that specifically target scientific programming requirements. In our hands, HDF5 has helped make GeneSifter a highly scalable and interactive web-application with less development effort than would have been needed to implement other technologies.

If you are software developer and are interested in BioHDF please visit www.biohdf.org. If you do not want to program and instead, want a way to easily analyze your NGS data to make new discoveries, please contact us.

Sunday, December 6, 2009

Expeditiously Exponential: Genome Standards in a New Era

One of the hot topics of 2009 has been the exponential growth in genomics and other data and how this growth will impact data use and sharing. The journal Science explored these issues in its policy forum in Oct. In early November, I discussed the first article, which was devoted to sharing data and data standards. The second article, listed under the category “Genomics,” focuses on how genomic standards need to evolve with new sequencing technologies.

Drafting By

The premise of the article “Genome Project Standards in a New Era of Sequencing” was to begin a conversation about how to define standards for sequence data quality in this new era of ultra-high throughput DNA sequencing. One of the “easy” things to do with Next Generation Sequencing (NGS) technologies is create draft genome sequences. A draft genomic sequence is defined as a collection of contig sequences that result from one, or a few, assemblies of large numbers of smaller DNA sequences called reads. In traditional Sanger sequencing a read was between 400 and 800 bases in length and came from a single clone, or sub-clone of a large DNA fragment. NGS reads, come from individual molecules in a DNA library and vary between 36 and 800 bases in length depending on the sequencing platform being used (454, Illumina, SOLiD, or Helicos).

A single NGS run can now produce enough data to create a draft assembly for many kinds of organisms with smaller genomes such as viruses, bacteria, and fungi. This makes it possible to create many draft genomes quickly and inexpensively. Indeed the article was accompanied by a figure showing that the current growth of draft sequences exceeds the growth of finished sequences by a significant amount. If this trend continues, the ratio of draft to finished sequences will grow exponentially into the foreseeable future.

Drafty Standards

The primary purpose for a complete genome sequence is to serve as a reference for other kinds of experiments. A well annotated reference is accompanied by a catalog of genes and their functions, as well as an ordering of the genes, regulatory regions, and the sequences needed for evolutionary comparisons that further elucidate genomic structure and function. A problem with draft sequences is that they can contain a large numbers of errors that range from incorrect base calls to more problematic mis-assemblies that place bases or groups of bases in the wrong order. Because, these holes leave some sequences are more drafty than others, they are less useful in fulfilling their purpose as reference data.

If we can describe the draftiness of a genome sequence we may be able to weight its fitness for a specific purpose. The article went on to tackle this problem by recommending a series of qualitative descriptions that describe levels of draft sequences. Beginning with the Standard Draft, or an assembly of contigs of unfiltered data from one or more sequencing platforms, the terms move through High-Quality Draft, to Improved High-Quality Draft, to Annotation-Directed Improvement, to Noncontiguous Finished, to Finished. Finished sequence is defined as less than 1 error per 100,000 bases and each genomic unit (chromosomes or plasmids that are capable of replication) is assembled into a single contig with a minimal number of exceptions. The individuals proposing these standards are a well respected group in the genome community and are working with the database groups and sequence ontology groups to incorporate these new descriptions into data submissions and annotations for data that may be used by others.

Given the high cost and lengthy time required to finish genomic sequences, finishing every genome to a high standard is impractical. If we are going to work with genomes that are finished to varying degrees, systematic ways to describe the quality of the data are needed . This policy recommendations are a good start, but more needs to be done to make the proposed standards useful.

First, standards need to be quantitative. Qualitative descriptions are less useful because they create downstream challenges when reference data are used in automated data processing and interpretation pipelines. As the numbers of available genomes grow into the thousands and tens of thousands, subjective standards make the data more and more cumbersome and difficult to review. Moreover without quantitative assessment, how will one know when they have an average error rate of 1 in 100,000 bases? The authors intentionally avoided recommending numeric thresholds in the proposed standards because the instrumentation and sequencing methodologies are changing rapidly. This may be true, but future discussions nevertheless should focus on quantitative descriptions for that very reason. It is because data collection methods and instrumentation are changing rapidly that we need measures we can compare. This is the new world.

Second, the article fails to address how the different standards might be applied in a practical sense. For example, what can I expect to do with a finished genome that I cannot do with a nearly finished genome? What is a standard draft useful for? How should I trust my results and what might I expect to do to verify a finding? While the article does a good job describing the quality attributes of the data that genome centers might produce, the proposed standards would have broader impact if they could more specifically set expectations of what could be done with data.

Without this understanding, we still won't know when when our data are good enough.

Sunday, November 8, 2009

Expeditiously Exponential: Data Sharing and Standardization

We can all agree that our ability to produce genomics and other kinds of data is increasing at exponential rates. Less clear, is understanding the consequences for how these data will be shared and ultimately used. These topics were explored in last month's (Oct. 9, 2009) policy forum feature in the journal Science.

The first article, listed under the category "megascience," dealt with issues about sharing 'omics data. The challenge being that systems biology research demands that data from many kinds of instrument platforms (DNA sequencing, mass spectrometry, flow cytometry, microscopy, and others) be combined in different ways to produce a complete picture of a biological system. Today, each platform generates its own kind of "big" data that, to be useful, must be computationally processed and transformed into standard outputs. Moreover, the data are often collected by different research groups focused on particular aspects of a common problem. Hence, the full utility of the data being produced can only be realized when the data are made open and shared throughout the scientific community. The article listed past efforts in developing sharing policies and the central table included 12 data sharing policies that are already in effect.

Sharing data solves half of the problem, the other aspect is being able to use the data once shared. This requires that data be structured and annotated in ways that make it understandable by a wide range of research groups. Such standards typically include minimum information check lists that define specific annotations, and which data should be kept from different platforms. The data and metadata are stored in structured documents that reflect a community's view about what is important to know with respect to how data were collected and the samples the data were collected from. The problem is that annotation standards are developed by diverse groups and, like the data, are expanding. This expansion creates new challenges with making data interoperable; the very problem standards try to address.

The article closed with high-level recommendations for enforcing policy through funding and publication requirements and acknowledged that full compliance requires that general concerns with pre-publication data use and patient information be addressed. More importantly, the article acknowledged that meeting data sharing and formatting standards has economic implications. That is, researches need time-efficient data management systems, the right kinds of tools and informatics expertise to meet standards. We also need to develop the right kind of global infrastructure to support data sharing.

Fortunately complying with data standards is an area where Geospiza can help. First, our software systems rely on open, scientifically valid tools and technologies. In DNA sequencing we support community developed alignment algorithms. The statistical analysis tools in GeneSifter Analysis Edition utilize R and BioConductor to compare gene expression data from both microarrays and DNA sequencing. Further, we participate in the community by contributing additional open-source tools and standards through efforts like the BioHDF project. Second, the GeneSifter Analysis and Laboratory platforms provide the time-effiecient data management solutions needed to move data through its complete life cycle from collection, to intermediate analysis, to publishing files in standard formats.

GeneSifter lowers researcher's economic barriers of meeting data sharing and annotation standards keep the focus on doing good science with the data.