Wednesday, January 5, 2011

Databases of databases

To kick off 2011, let’s talk about databases. Our ability to collect ever increasing amounts of data at faster rates is driving a corresponding increase in specialized databases that organize data and information. A recent editorial publication in the journal Database, yes we now have a journal called Database, proposed a draft information specification for biological databases.


Because we derive knowledge by integrating information in novel ways, hence we need ways to make specialized information repositories interoperate.

Why do we need standards?

Because the current and growing collection of specialized databases are poorly characterized with respect their mission, categorization, and practical use. 

The Nucleic Acids Research (NAR) database of databases illustrates the problem well. For the past 18 years, every Jan 1, NAR has published a database issue in which new databases are described along with others that have been updated. The issues typically contain between 120 and 180 articles representing a fraction of the databases listed by NAR. When one counts the number of databases, explores their organization, and reads the accompanying editorial introductions several interesting observations can be made.

First, over the years there has been a healthy growth of databases designed to capture many different kinds of specialized information. This year’s NAR database was updated to include 1330 databases, which are organized into 14 categories and 40 subcategories. Some categories like Metabolic and Signaling Pathways, or Organelle Databases look highly specific whereas others such as Plant Databases, or Protein Sequence Databases are general. Subcategories have similar issues. In some cases categories and subcategories make sense. In other cases it is not clear what the intent of categorization is. For example the category RNA sequence databases lists 73 databases that appear to be mostly rRNA and smallRNA databases. Within the list are a couple of RNA virus databases like the HIV Sequence Database and Subviral RNA Database. These are also listed under subcategory Viral Genome Databases in the Genomics Databases (non-vertebrates) category in an attempt to cross reference the database under different category. OK, I get that. HIV is in RNA sequences because it is an RNA virus. But what about the hepatitis, influenza, and other RNA viruses listed under Viral genome databases, why aren’t they in RNA sequences? While I’m picking on RNA sequences, how come all of the splicing databases are listed in a subcategory in Nucleotide Sequence Databases? Why isn’t RNA Sequences a subcategory under Nucleotide Sequence Databases? Isn’t RNA composed of Nucleotides? It makes one wonder how the databases are categorized.

Categorizing databases to make them easy to find and quickly grock their utility is clearly challenging. This issue becomes more profound when the level of database redundancy, determined from the databases’ names, is considered. This analysis is, of course, limited to names that can be understood. The Hollywood Database for example does not store Julia Roberts' DNA, rather it is an exon annotation database. Fortunately, many databases are not named so cleverly. Going back to our RNA Sequence category we can find many ribosomal sequence databases, several tRNA databases, and general RNA sequence databases. There is even a cluster of eight microRNA databases all starting with an “mi” or “miR” prefix. There are enough rice (18) and Arabidopsis databases (28) that they get their own subcategories. Without too much effort one can see there are many competing resources, yet choosing the ones best suited for your needs would require a substantial investment of time to understand how these databases overlap, where they are unique, and, in many cases, navigate idiosyncratic interfaces and file formats to retrieve and utilize their information. When maintenance and overall information validity is factored in, the challenge compounds.

How do things get this way?

Evolution. Software systems, like biological systems change over time. Databases like organisms evolve from common ancestors. In this way, new species are formed. Selective pressures enhance useful features and increase their redundancy, and cause extinctions. We can see these patterns in play for biological databases by examining the tables of contents and introductory editorials for the past 16 years of the NAR database issue. Interestingly, in the past six or seven years, the issue’s editor has made the point of recording the issue’s anniversary making 2011 the 18th year of this issue. Yet, easily accessible data can be obtained only to 1996, 16 years ago. History is hard and incomplete, just like any evolutionary record.

We cannot discuss database diversity without trying to count the number of species. This is hard too. NAR is an easily assessable site with a list that can be counted. However, one of the databases in the 1999 and 2000 NAR database issues was DBcat, a database of databases. At that time it tracked more than 400 and 500 databases, while NAR tracked 103 and 227 databases, respectively, a fairly large discrepancy. DBcat eventually went extinct due to lack of funding, a very significant selective pressure. Speaking of selective pressure it would be interesting to understand how databases are chosen to be included in NAR’s list. Articles are submitted, and presumably reviewed, so there is self selection and some peer review. The total number of new databases, since 1996, is 1,308, which is close to 1,330, so the current list likely an accumulation of database entries submitted over the years. From the editorial comments, the selection process is getting more selective as statements have changed from databases that might be useful (2007) to "carefully selected" (2010, 2011). Back in 2006, 2007, and 2008 there was even mention of databases being dropped due to obsolescence.

Where do we go from here?

From the current data, one would expect that 2011 will see between 60 and 120 new databases appear. Groups like BioDBcore and other committees and trying to encourage standardization with respect to a database’s meta data (name, URL, contacts, when established, data stored, standards used, and much more). This may be helpful, but when I read the list of proposed standards, I am less optimistic because the standards do not address the hard issues. Why, for example, do we need 18 different rice databases or eight “mi*” databases. For that matter, why do we have a journal called Database? NAR does a better job listing databases than Database and being a journal about databases wouldn't it be a good idea if Database tracked databases? And, critical information, like usage, last updated, citations, and user comments, which would help one more easily evaluate whether to invest time investigating the resource, are missing.

Perhaps we should think about better ways to communicate a database’s value to the community it serves, and in the case where several databases do similar things, standards committees should discuss how to bring the individual databases together to share their common features and accentuate their unique qualities to support research endeavors. After all, no amount of annotation is useful if I still have too many choices to sort through.


giovanni said...

the figure already shows ~100 new databases for 2011. How can it be? Is it a prediction, or there are alredy 100 pre-printed papers about databases as of January 2011?

About the Database journal: consider that it is a really young journal, less than 2 years old. Moreover the main purpose of Nucleic Acid Research, as its name suggest, is not cataloguing databases, even if they are doing a good job. As the two journals are from the same group, let's see if Database will take the place of the NAR issue some day.

Todd Smith said...

Giovanni, you are correct, the database grown is for the previous year. Since the NAR issue comes out in Jan, it reports on DBs that have appeared in previous years.

Good point on the Database journal. I agree, it would make sense if they fulfilled NAR's role.