Thursday, January 17, 2013

Bio Databases 2013  I seem to have committed to an annual ritual of summarizing the Nucleic Acids Research (NAR) Database Issue [1]. I do this because it is important to understand and emphasize the increasing role of data analysis in modern biology and remind us about the challenges that persist in turning data into knowledge.

Sometimes I hear individuals say they are building a database of all knowledge. To them I say good luck! The reality is new knowledge is developed from unique insights that are derived from specialized aggregations of information. Hence, as more data become available, through decreasing data collection costs, the number of resources and tools that are used to organize, analyze, and annotate data and information increases. Interestingly data cost decreases result from increased production due to technical improvements, which is an exponential function, whereas database growth is linear. Collecting data is the easy part.

How many are there?

Databases live in the wild and thus are hard to count.  Reading the introduction to the database issue one would think 88 new databases were added (cited), but if you compare the number being tracked by NAR in 2012 (1380) to 2013 (1512), you get 132. Moreover, databases tracked by NAR are contributed by authors.  Some don't bother with this.  For example, SeattleSNPs, home of the SeattleSeq Annotation and important Genome Variant Servers*, is not listed in NAR.  Nevertheless the NAR registry continues to increase by about 100 databases per year.

What's new?

Last year, I noted that the new databases did not reflect any discrenable pattern in terms of how the field of biology was changing. Rather the new databases reflect increasing specialization and complexity.  That trend continues, but this year Fernández-Suárez and Galperin note the emergence of new databases for studying human disease. Altogether eight databases were cited in the introduction. Several others are listed in a table highlighting the new new databases. While databases specializing in human genetics are not new, the past year saw an increased emphasis on understanding the relationship between genotype and phenotype as we advance our understanding of rare variation and population genetics.

As noted, many databases support human genomics research. If you visit the NAR Database Summary Category List and expand the list of databases under the Human Genes and Diseases list, you find four sub categories (General human Genetics, general polymorphism, Cancer gene, and Gene-, system-, or disease-specific databases) listing approximately 174 database. I say approximately because, as noted above, databases are hard to count. Curiously, just above Human Genes and Diseases is a category called Human and Vertebrate Genomes. Database are hard to classify too.

What's useful?

It is clear that the growing number of databases reflects an increasing level of specialization. Also likely is a high degree of redundancy. 10 microRNA databases (found by virtue of starting with "miR") cover general and specific topics including miRNAs that are predicted from sequence or literature, verified by experiment as existing or having a target, being possibly pathogenic, or existing in different organisms. It would be interesting to see which of these databases have the same data, but that is hard as some sites make all data available and some make their data searchable only.  In the former case, getting the data requires that it be put into a common format to make comparisons. Hence, access and interoperability issues persist.

Databases also persist. Fernández-Suárez and Galperin commented on efforts to curate the NAR collection. The annual attrition rare is less than 5% and greater than 90% of the databases are functional as determined by their response to webbots. Some have merged into other projects. What is not known is the quality of the information. In other words how are databases verified for accuracy or maintained to reflect our changing state of knowledge? As databases become increasing used in medical sequencing caveat emptor changes to caveat venditor and validation will be a critical component of design and maintenance. Perhaps future issues of the NAR database update will comment on these challenges.

[1] Fernández-Suárez XM, and Galperin MY (2013). The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic acids research, 41 (D1) PMID: 23203983

* The SeattleSeq and Genome Variant Server links will break at the next update because the URL's contain the respective database version numbers.

No comments: