Friday, January 21, 2011

dbSNP, or is it?

dbSNP is NCBI’s catalog of DNA variation. While the SNP in the name implies a focus on Single Nucleotide Polymorphisms, dbSNP is far more comprehensive and includes length variants, mutations, and a plethora of annotations that characterize over 75 million variants from 89 organisms.

Previously, I discussed how the numbers of biological information repositories are growing each year. Nucleic Acids Research now tracks over 1300 databases that contain specialized subsets of DNA, RNA, protein sequences, and other kinds of biochemical data along with annotations (metadata) that can be mined or searched to aid our scientific research.

Most of these resources have descriptions and publications describing high-level details about their mission with respect to what kinds of data are stored and how the resource can be used. While useful, these descriptions are typically out of date, because, like everything else in information science, each repository is undergoing significant growth in terms of data stored and the data’s annotations. As we contemplate how to use these resources in integrated analyses we need methods to summarize their content and extract data in global ways.

As an example, let’s consider dbSNP. According to dbSNP’s build history the first release was in Dec 1, 1998. Build 2, 9 days later, had 11 new SNPs added from the debnick’s 981209.dat file. Now, dbSNP is 12 years old and is at build 132. The term build is a software way of saying version.

Several methods can be used to access dbSNP data. These include multiple web interfaces at NCBI and flat files that hold entire datasets in XML, SQL tables, and VCF formats. When first published, in 1999 [1], dbSNP contained 4713 variants and few annotations. As of build 132, the human specific database contains either 30,443,455 (release notes) or 28,826,054 variants (VCF file). Why the discrepancy? Share your idea with a comment.

What else can we learn about human DNA variation from dbSNP’s VCF file?

The entire collection of human variants - minus 1,617,401 - can be obtained in a single VCF file. VCF stands for Variant Call Format ; it is a standard created by the 1000 Genomes project [2] to list and annotate genotypes. The vcard format also uses the “vcf” file extension . Thus, genomic vcf files have cute business card icons on Macs and Windows. Bioinformatics is fun that way.

The dbSNP VCF file is a convenient way to get a global view of the database. Variants are listed by chromosome and position. Each position also lists the NCBI rs ID, reference, and variant sequence(s). Chromosomes include the 22 autosomes, X, Y, MT (mitochondria) and PAR (Pseudoautosomal Regions[3]).  The last column (INFO) is the most interesting. It can contain up to 49 specific annotations that describe a variant’s type, its origin, biological features, population characteristics, and whether is it linked to other resources. Many of the annotations are tags, but some contain additional information. Between the different tags and information within tags, there are over 50 ways to describe a variant. One of the most obfuscated annotations is a 12-byte bitfield, which must be decoded to understand. No worry, much of its information appears to be repeated in the readable annotations. Others have also noted that the bitfield is overly clever.

Digging deeper

We can use the VCF file to learn a great deal about dbSNP and the biology of human variation. But, you cannot do this by reading the file. It is over 30 million lines long! You’re also not going to be able to analyze these data in ExcelTM, so you need to put the data in some kind of database or binary file format. I used HDF5 and pytables to put the data into essentially a 30MM row by 50+ column table that can be efficiently queried using simple python scripts.

So, what can we learn? We can use the dbSNPBuildNumber annotation to observe dbSNP’s growth. Cumulative growth shows that the recent builds have contributed the majority of variants. In particular, the 1000 Genomes project’s contributions account for close to 85% of the variants in dbSNP[2]. Examining the data by build number also shows that a build can consist of relatively few variants. There are even 369 variants from a future 133 build.

In dbSNP, variants are described in four ways: SNP, INDEL, MIXED, and MULTI-BASE. The vast majority (82%) are SNPs, with INDELs (INsertions and DELetions) forming a large second class. When MIXED and MULTI-BASE variants are examined, they do not appear to look any different than INDELS, so I am not clear on the purpose of this tag. Perhaps it is there for bioinformatics enjoyment, because you have to find these terms in the 30MM lines; they are not listed in the VCF header. It was also interesting to learn that the longest indels are on the order of a 1000 bases.

The VCF file contains both reference and variant sequences at each position. The variant sequence can list multiple bases, or sequences, as each position can have multiple alleles. While nearly all variants are a single allele, about 3% have between two and 11 alleles. I assume the variants with the highest numbers of alleles represent complex indel patterns.

Given that the haploid human genome contains approximately 3.1 billion bases, the variants in dbSNP represent nearly 1% of the available positions. Their distribution across the 22 autosomes, as defined by the numbers of positions divided by the chromosome length, is fairly constant around 1%. The X and Y chromosomes and their PARs are, however much lower, with ratios of 0.45%, 0.12%, and 0.08% respectively. A likely reason for this observation is that sampling of sex chromosomes in current sequencing projects is low relative to autosomes. The number of mitochondria variants, as a fraction of genome length, is, on the other hand much higher at 3.95%.

It is important to point out that the 1% level of variation is simply a count of all currently known positions where variation can occur in individuals. These data are compiled from many individuals. On a per individual basis, the variation frequency is much lower with each individual having between three and four million differences when their genome is compared to the other people's genomes. 

Clearly, from the original publications and current dbSNP reports, the database has grown significantly and is much richer and undergoing changes in ways that cannot be easily communicated through traditional publication methods. The challenge going forward is developing software that can adapt to these changes to evaluate resources and use their content in effective ways.

Next time I’ll discuss the annotations.

References and notes:

1. Sherry ST, Ward M, & Sirotkin K (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9 (8), 677-9 PMID: 10447503

2. 1000 Genomes Project (1KG) is an endeavor to characterize DNA sequence variation in humans. Visit the website at www.1000genomes.org or read the paper: 1000 Genomes Consortium. (2010) A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073.

3. PAR - PseudoAutosomal Regions - are located at the ends of the X and Y chromosome.  The DNA in these regions can recombine between the sex chromosomes, hence PAR  genes demonstrate autosomal inheritance patterns, rather than sex inheritance patterns.  You can read more at http://en.wikipedia.org/wiki/Pseudoautosomal_region

Wednesday, January 5, 2011

Databases of databases

To kick off 2011, let’s talk about databases. Our ability to collect ever increasing amounts of data at faster rates is driving a corresponding increase in specialized databases that organize data and information. A recent editorial publication in the journal Database, yes we now have a journal called Database, proposed a draft information specification for biological databases.

Why?

Because we derive knowledge by integrating information in novel ways, hence we need ways to make specialized information repositories interoperate.

Why do we need standards?

Because the current and growing collection of specialized databases are poorly characterized with respect their mission, categorization, and practical use. 

The Nucleic Acids Research (NAR) database of databases illustrates the problem well. For the past 18 years, every Jan 1, NAR has published a database issue in which new databases are described along with others that have been updated. The issues typically contain between 120 and 180 articles representing a fraction of the databases listed by NAR. When one counts the number of databases, explores their organization, and reads the accompanying editorial introductions several interesting observations can be made.

First, over the years there has been a healthy growth of databases designed to capture many different kinds of specialized information. This year’s NAR database was updated to include 1330 databases, which are organized into 14 categories and 40 subcategories. Some categories like Metabolic and Signaling Pathways, or Organelle Databases look highly specific whereas others such as Plant Databases, or Protein Sequence Databases are general. Subcategories have similar issues. In some cases categories and subcategories make sense. In other cases it is not clear what the intent of categorization is. For example the category RNA sequence databases lists 73 databases that appear to be mostly rRNA and smallRNA databases. Within the list are a couple of RNA virus databases like the HIV Sequence Database and Subviral RNA Database. These are also listed under subcategory Viral Genome Databases in the Genomics Databases (non-vertebrates) category in an attempt to cross reference the database under different category. OK, I get that. HIV is in RNA sequences because it is an RNA virus. But what about the hepatitis, influenza, and other RNA viruses listed under Viral genome databases, why aren’t they in RNA sequences? While I’m picking on RNA sequences, how come all of the splicing databases are listed in a subcategory in Nucleotide Sequence Databases? Why isn’t RNA Sequences a subcategory under Nucleotide Sequence Databases? Isn’t RNA composed of Nucleotides? It makes one wonder how the databases are categorized.

Categorizing databases to make them easy to find and quickly grock their utility is clearly challenging. This issue becomes more profound when the level of database redundancy, determined from the databases’ names, is considered. This analysis is, of course, limited to names that can be understood. The Hollywood Database for example does not store Julia Roberts' DNA, rather it is an exon annotation database. Fortunately, many databases are not named so cleverly. Going back to our RNA Sequence category we can find many ribosomal sequence databases, several tRNA databases, and general RNA sequence databases. There is even a cluster of eight microRNA databases all starting with an “mi” or “miR” prefix. There are enough rice (18) and Arabidopsis databases (28) that they get their own subcategories. Without too much effort one can see there are many competing resources, yet choosing the ones best suited for your needs would require a substantial investment of time to understand how these databases overlap, where they are unique, and, in many cases, navigate idiosyncratic interfaces and file formats to retrieve and utilize their information. When maintenance and overall information validity is factored in, the challenge compounds.

How do things get this way?

Evolution. Software systems, like biological systems change over time. Databases like organisms evolve from common ancestors. In this way, new species are formed. Selective pressures enhance useful features and increase their redundancy, and cause extinctions. We can see these patterns in play for biological databases by examining the tables of contents and introductory editorials for the past 16 years of the NAR database issue. Interestingly, in the past six or seven years, the issue’s editor has made the point of recording the issue’s anniversary making 2011 the 18th year of this issue. Yet, easily accessible data can be obtained only to 1996, 16 years ago. History is hard and incomplete, just like any evolutionary record.

We cannot discuss database diversity without trying to count the number of species. This is hard too. NAR is an easily assessable site with a list that can be counted. However, one of the databases in the 1999 and 2000 NAR database issues was DBcat, a database of databases. At that time it tracked more than 400 and 500 databases, while NAR tracked 103 and 227 databases, respectively, a fairly large discrepancy. DBcat eventually went extinct due to lack of funding, a very significant selective pressure. Speaking of selective pressure it would be interesting to understand how databases are chosen to be included in NAR’s list. Articles are submitted, and presumably reviewed, so there is self selection and some peer review. The total number of new databases, since 1996, is 1,308, which is close to 1,330, so the current list likely an accumulation of database entries submitted over the years. From the editorial comments, the selection process is getting more selective as statements have changed from databases that might be useful (2007) to "carefully selected" (2010, 2011). Back in 2006, 2007, and 2008 there was even mention of databases being dropped due to obsolescence.

Where do we go from here?

From the current data, one would expect that 2011 will see between 60 and 120 new databases appear. Groups like BioDBcore and other committees and trying to encourage standardization with respect to a database’s meta data (name, URL, contacts, when established, data stored, standards used, and much more). This may be helpful, but when I read the list of proposed standards, I am less optimistic because the standards do not address the hard issues. Why, for example, do we need 18 different rice databases or eight “mi*” databases. For that matter, why do we have a journal called Database? NAR does a better job listing databases than Database and being a journal about databases wouldn't it be a good idea if Database tracked databases? And, critical information, like usage, last updated, citations, and user comments, which would help one more easily evaluate whether to invest time investigating the resource, are missing.

Perhaps we should think about better ways to communicate a database’s value to the community it serves, and in the case where several databases do similar things, standards committees should discuss how to bring the individual databases together to share their common features and accentuate their unique qualities to support research endeavors. After all, no amount of annotation is useful if I still have too many choices to sort through.