Tuesday, January 24, 2012
Thursday, January 5, 2012
Let's get 2012 started with an update on the growth of biological databases. About this time last year, I summarized Nucleic Acids Research's (NAR) annual database issue where authors submit papers describing updates to existing databases and present new databases. In that post, I predicted that we would see between 60 and 120 new databases in 2011. This year's update included 92 new databases .
How have things changed?
Overall the number of databases being tracked by NAR has grown from 1330 in 2011 to 1380 in 2012 (left hand figure). As 92 new databases were added, 42 must have been dropped. Interestingly, when one views the new database list page only 90 databases are listed. I not only counted this using my fingers and toes - more than twice, but also copied the list into Excel to verify my original count. I wonder what the two, but unaccounted, new databases are? Never mind the 42 that disappeared, those are not listed.
What are the new databases?
Reviewing the list of the 90 new databases, does not reveal any clear patterns or trends regarding changes in biology. Instead, the list reflects increasing complexity and specialization. Some databases tackle new kinds of data, many appear to be refinements of existing databases, and others contain highly specific information. For example, the UMD-BRCA1/ BRCA2 database contains information about BRCA1 and BRCA2 mutations detected in France. A few databases, such as the BitterDB: a database of bitter taste molecules and receptors, and Newt-omics: data on the red spotted newt (Notophthalmus viridescens), and IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature, caught my attention because of their interesting descriptions. I especially like Disease Ontology: Ontology for a variety of human diseases. I wonder which diseases are their favorites?
Also of interest is the growth of wiki's as databases. This year, 10 of the new databases are wiki's where the community is invited to make contributions to the resource in a similar fashion to wikipedia. I should say almost 10 of the databases are wiki's, one, SeqAnswers is actually a forum, that's technically not a wiki, of course some might argue that a wiki might not be a database.
So, what does this mean?
The growing list of databases is an interesting way to think about biology, DNA, and proteins. Simply examining the list is instructional and one can think of ways that mining several resources together could create new insights. However, therein lies the challenge. Many of these resources are not designed to interoperate and it is not clear how long they will last or be updated. This final point was made at the close of the introduction article. In a section entitled "Sustainability of bioinformatics databases," the authors discussed the past year's controversy surrounding NCBI's SRA database, previous challenges with Swiss-Prot, and the current instability with the KEGG and TAIR databases. They cite a proposal to centralize more resources .
But in the end biological databases really do model biology and follow the principals of evolution. Perhaps it is apropos that new ones emerge through speciation, while others go extinct.
1. Galperin, M., & Fernandez-Suarez, X. (2011). The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection Nucleic Acids Research, 40 (D1) DOI: 10.1093/nar/gkr1196
2. Parkhill J, Birney E, Kersey P. Genomic information infrastructure after the deluge. Genome Biol. 2010;11:402