Sunday, December 6, 2009

Expeditiously Exponential: Genome Standards in a New Era

One of the hot topics of 2009 has been the exponential growth in genomics and other data and how this growth will impact data use and sharing. The journal Science explored these issues in its policy forum in Oct. In early November, I discussed the first article, which was devoted to sharing data and data standards. The second article, listed under the category “Genomics,” focuses on how genomic standards need to evolve with new sequencing technologies.

Drafting By

The premise of the article “Genome Project Standards in a New Era of Sequencing” was to begin a conversation about how to define standards for sequence data quality in this new era of ultra-high throughput DNA sequencing. One of the “easy” things to do with Next Generation Sequencing (NGS) technologies is create draft genome sequences. A draft genomic sequence is defined as a collection of contig sequences that result from one, or a few, assemblies of large numbers of smaller DNA sequences called reads. In traditional Sanger sequencing a read was between 400 and 800 bases in length and came from a single clone, or sub-clone of a large DNA fragment. NGS reads, come from individual molecules in a DNA library and vary between 36 and 800 bases in length depending on the sequencing platform being used (454, Illumina, SOLiD, or Helicos).

A single NGS run can now produce enough data to create a draft assembly for many kinds of organisms with smaller genomes such as viruses, bacteria, and fungi. This makes it possible to create many draft genomes quickly and inexpensively. Indeed the article was accompanied by a figure showing that the current growth of draft sequences exceeds the growth of finished sequences by a significant amount. If this trend continues, the ratio of draft to finished sequences will grow exponentially into the foreseeable future.

Drafty Standards

The primary purpose for a complete genome sequence is to serve as a reference for other kinds of experiments. A well annotated reference is accompanied by a catalog of genes and their functions, as well as an ordering of the genes, regulatory regions, and the sequences needed for evolutionary comparisons that further elucidate genomic structure and function. A problem with draft sequences is that they can contain a large numbers of errors that range from incorrect base calls to more problematic mis-assemblies that place bases or groups of bases in the wrong order. Because, these holes leave some sequences are more drafty than others, they are less useful in fulfilling their purpose as reference data.

If we can describe the draftiness of a genome sequence we may be able to weight its fitness for a specific purpose. The article went on to tackle this problem by recommending a series of qualitative descriptions that describe levels of draft sequences. Beginning with the Standard Draft, or an assembly of contigs of unfiltered data from one or more sequencing platforms, the terms move through High-Quality Draft, to Improved High-Quality Draft, to Annotation-Directed Improvement, to Noncontiguous Finished, to Finished. Finished sequence is defined as less than 1 error per 100,000 bases and each genomic unit (chromosomes or plasmids that are capable of replication) is assembled into a single contig with a minimal number of exceptions. The individuals proposing these standards are a well respected group in the genome community and are working with the database groups and sequence ontology groups to incorporate these new descriptions into data submissions and annotations for data that may be used by others.

Given the high cost and lengthy time required to finish genomic sequences, finishing every genome to a high standard is impractical. If we are going to work with genomes that are finished to varying degrees, systematic ways to describe the quality of the data are needed . This policy recommendations are a good start, but more needs to be done to make the proposed standards useful.

First, standards need to be quantitative. Qualitative descriptions are less useful because they create downstream challenges when reference data are used in automated data processing and interpretation pipelines. As the numbers of available genomes grow into the thousands and tens of thousands, subjective standards make the data more and more cumbersome and difficult to review. Moreover without quantitative assessment, how will one know when they have an average error rate of 1 in 100,000 bases? The authors intentionally avoided recommending numeric thresholds in the proposed standards because the instrumentation and sequencing methodologies are changing rapidly. This may be true, but future discussions nevertheless should focus on quantitative descriptions for that very reason. It is because data collection methods and instrumentation are changing rapidly that we need measures we can compare. This is the new world.

Second, the article fails to address how the different standards might be applied in a practical sense. For example, what can I expect to do with a finished genome that I cannot do with a nearly finished genome? What is a standard draft useful for? How should I trust my results and what might I expect to do to verify a finding? While the article does a good job describing the quality attributes of the data that genome centers might produce, the proposed standards would have broader impact if they could more specifically set expectations of what could be done with data.

Without this understanding, we still won't know when when our data are good enough.

No comments: