Previously, I discussed how the numbers of biological information repositories are growing each year. Nucleic Acids Research now tracks over 1300 databases that contain specialized subsets of DNA, RNA, protein sequences, and other kinds of biochemical data along with annotations (metadata) that can be mined or searched to aid our scientific research.
Most of these resources have descriptions and publications describing high-level details about their mission with respect to what kinds of data are stored and how the resource can be used. While useful, these descriptions are typically out of date, because, like everything else in information science, each repository is undergoing significant growth in terms of data stored and the data’s annotations. As we contemplate how to use these resources in integrated analyses we need methods to summarize their content and extract data in global ways.
As an example, let’s consider dbSNP. According to dbSNP’s build history the first release was in Dec 1, 1998. Build 2, 9 days later, had 11 new SNPs added from the debnick’s 981209.dat file. Now, dbSNP is 12 years old and is at build 132. The term build is a software way of saying version.
Several methods can be used to access dbSNP data. These include multiple web interfaces at NCBI and flat files that hold entire datasets in XML, SQL tables, and VCF formats. When first published, in 1999 [1], dbSNP contained 4713 variants and few annotations. As of build 132, the human specific database contains either 30,443,455 (release notes) or 28,826,054 variants (VCF file). Why the discrepancy? Share your idea with a comment.
What else can we learn about human DNA variation from dbSNP’s VCF file?
data:image/s3,"s3://crabby-images/73841/73841d3762bf481b2f6727f9f746e6ed44f318d4" alt=""
The dbSNP VCF file is a convenient way to get a global view of the database. Variants are listed by chromosome and position. Each position also lists the NCBI rs ID, reference, and variant sequence(s). Chromosomes include the 22 autosomes, X, Y, MT (mitochondria) and PAR (Pseudoautosomal Regions[3]). The last column (INFO) is the most interesting. It can contain up to 49 specific annotations that describe a variant’s type, its origin, biological features, population characteristics, and whether is it linked to other resources. Many of the annotations are tags, but some contain additional information. Between the different tags and information within tags, there are over 50 ways to describe a variant. One of the most obfuscated annotations is a 12-byte bitfield, which must be decoded to understand. No worry, much of its information appears to be repeated in the readable annotations. Others have also noted that the bitfield is overly clever.
Digging deeper
We can use the VCF file to learn a great deal about dbSNP and the biology of human variation. But, you cannot do this by reading the file. It is over 30 million lines long! You’re also not going to be able to analyze these data in ExcelTM, so you need to put the data in some kind of database or binary file format. I used HDF5 and pytables to put the data into essentially a 30MM row by 50+ column table that can be efficiently queried using simple python scripts.
data:image/s3,"s3://crabby-images/bcd47/bcd47dc96a4eed643ca2f7f680879bada783fb07" alt=""
data:image/s3,"s3://crabby-images/30e8d/30e8d90a9357897faa8ec063f566ede73c85fcc9" alt=""
data:image/s3,"s3://crabby-images/cf557/cf5572e38c55ea0d6c90bc3869f8a02f261588cb" alt=""
data:image/s3,"s3://crabby-images/3b545/3b545639f5b4a6bdba13106ee0971c0501f43820" alt=""
data:image/s3,"s3://crabby-images/0faa8/0faa8cacdbb8ec4b6c5c56870a225229cb7c7971" alt=""
It is important to point out that the 1% level of variation is simply a count of all currently known positions where variation can occur in individuals. These data are compiled from many individuals. On a per individual basis, the variation frequency is much lower with each individual having between three and four million differences when their genome is compared to the other people's genomes.
Clearly, from the original publications and current dbSNP reports, the database has grown significantly and is much richer and undergoing changes in ways that cannot be easily communicated through traditional publication methods. The challenge going forward is developing software that can adapt to these changes to evaluate resources and use their content in effective ways.
Next time I’ll discuss the annotations.
References and notes:
1. Sherry ST, Ward M, & Sirotkin K (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9 (8), 677-9 PMID: 10447503
1. Sherry ST, Ward M, & Sirotkin K (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9 (8), 677-9 PMID: 10447503
2. 1000 Genomes Project (1KG) is an endeavor to characterize DNA sequence variation in humans. Visit the website at www.1000genomes.org or read the paper: 1000 Genomes Consortium. (2010) A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073.
3. PAR - PseudoAutosomal Regions - are located at the ends of the X and Y chromosome. The DNA in these regions can recombine between the sex chromosomes, hence PAR genes demonstrate autosomal inheritance patterns, rather than sex inheritance patterns. You can read more at http://en.wikipedia.org/wiki/Pseudoautosomal_region
3. PAR - PseudoAutosomal Regions - are located at the ends of the X and Y chromosome. The DNA in these regions can recombine between the sex chromosomes, hence PAR genes demonstrate autosomal inheritance patterns, rather than sex inheritance patterns. You can read more at http://en.wikipedia.org/wiki/Pseudoautosomal_region