Previously, I discussed how the numbers of biological information repositories are growing each year. Nucleic Acids Research now tracks over 1300 databases that contain specialized subsets of DNA, RNA, protein sequences, and other kinds of biochemical data along with annotations (metadata) that can be mined or searched to aid our scientific research.
Most of these resources have descriptions and publications describing high-level details about their mission with respect to what kinds of data are stored and how the resource can be used. While useful, these descriptions are typically out of date, because, like everything else in information science, each repository is undergoing significant growth in terms of data stored and the data’s annotations. As we contemplate how to use these resources in integrated analyses we need methods to summarize their content and extract data in global ways.
As an example, let’s consider dbSNP. According to dbSNP’s build history the first release was in Dec 1, 1998. Build 2, 9 days later, had 11 new SNPs added from the debnick’s 981209.dat file. Now, dbSNP is 12 years old and is at build 132. The term build is a software way of saying version.
Several methods can be used to access dbSNP data. These include multiple web interfaces at NCBI and flat files that hold entire datasets in XML, SQL tables, and VCF formats. When first published, in 1999 [1], dbSNP contained 4713 variants and few annotations. As of build 132, the human specific database contains either 30,443,455 (release notes) or 28,826,054 variants (VCF file). Why the discrepancy? Share your idea with a comment.
What else can we learn about human DNA variation from dbSNP’s VCF file?
The entire collection of human variants - minus 1,617,401 - can be obtained in a single VCF file. VCF stands for Variant Call Format ; it is a standard created by the 1000 Genomes project [2] to list and annotate genotypes. The vcard format also uses the “vcf” file extension . Thus, genomic vcf files have cute business card icons on Macs and Windows. Bioinformatics is fun that way.
The dbSNP VCF file is a convenient way to get a global view of the database. Variants are listed by chromosome and position. Each position also lists the NCBI rs ID, reference, and variant sequence(s). Chromosomes include the 22 autosomes, X, Y, MT (mitochondria) and PAR (Pseudoautosomal Regions[3]). The last column (INFO) is the most interesting. It can contain up to 49 specific annotations that describe a variant’s type, its origin, biological features, population characteristics, and whether is it linked to other resources. Many of the annotations are tags, but some contain additional information. Between the different tags and information within tags, there are over 50 ways to describe a variant. One of the most obfuscated annotations is a 12-byte bitfield, which must be decoded to understand. No worry, much of its information appears to be repeated in the readable annotations. Others have also noted that the bitfield is overly clever.
Digging deeper
We can use the VCF file to learn a great deal about dbSNP and the biology of human variation. But, you cannot do this by reading the file. It is over 30 million lines long! You’re also not going to be able to analyze these data in ExcelTM, so you need to put the data in some kind of database or binary file format. I used HDF5 and pytables to put the data into essentially a 30MM row by 50+ column table that can be efficiently queried using simple python scripts.
So, what can we learn? We can use the dbSNPBuildNumber annotation to observe dbSNP’s growth. Cumulative growth shows that the recent builds have contributed the majority of variants. In particular, the 1000 Genomes project’s contributions account for close to 85% of the variants in dbSNP[2]. Examining the data by build number also shows that a build can consist of relatively few variants. There are even 369 variants from a future 133 build.
In dbSNP, variants are described in four ways: SNP, INDEL, MIXED, and MULTI-BASE. The vast majority (82%) are SNPs, with INDELs (INsertions and DELetions) forming a large second class. When MIXED and MULTI-BASE variants are examined, they do not appear to look any different than INDELS, so I am not clear on the purpose of this tag. Perhaps it is there for bioinformatics enjoyment, because you have to find these terms in the 30MM lines; they are not listed in the VCF header. It was also interesting to learn that the longest indels are on the order of a 1000 bases.
The VCF file contains both reference and variant sequences at each position. The variant sequence can list multiple bases, or sequences, as each position can have multiple alleles. While nearly all variants are a single allele, about 3% have between two and 11 alleles. I assume the variants with the highest numbers of alleles represent complex indel patterns.
Given that the haploid human genome contains approximately 3.1 billion bases, the variants in dbSNP represent nearly 1% of the available positions. Their distribution across the 22 autosomes, as defined by the numbers of positions divided by the chromosome length, is fairly constant around 1%. The X and Y chromosomes and their PARs are, however much lower, with ratios of 0.45%, 0.12%, and 0.08% respectively. A likely reason for this observation is that sampling of sex chromosomes in current sequencing projects is low relative to autosomes. The number of mitochondria variants, as a fraction of genome length, is, on the other hand much higher at 3.95%.
It is important to point out that the 1% level of variation is simply a count of all currently known positions where variation can occur in individuals. These data are compiled from many individuals. On a per individual basis, the variation frequency is much lower with each individual having between three and four million differences when their genome is compared to the other people's genomes.
It is important to point out that the 1% level of variation is simply a count of all currently known positions where variation can occur in individuals. These data are compiled from many individuals. On a per individual basis, the variation frequency is much lower with each individual having between three and four million differences when their genome is compared to the other people's genomes.
Clearly, from the original publications and current dbSNP reports, the database has grown significantly and is much richer and undergoing changes in ways that cannot be easily communicated through traditional publication methods. The challenge going forward is developing software that can adapt to these changes to evaluate resources and use their content in effective ways.
Next time I’ll discuss the annotations.
References and notes:
1. Sherry ST, Ward M, & Sirotkin K (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9 (8), 677-9 PMID: 10447503
1. Sherry ST, Ward M, & Sirotkin K (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9 (8), 677-9 PMID: 10447503
2. 1000 Genomes Project (1KG) is an endeavor to characterize DNA sequence variation in humans. Visit the website at www.1000genomes.org or read the paper: 1000 Genomes Consortium. (2010) A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073.
3. PAR - PseudoAutosomal Regions - are located at the ends of the X and Y chromosome. The DNA in these regions can recombine between the sex chromosomes, hence PAR genes demonstrate autosomal inheritance patterns, rather than sex inheritance patterns. You can read more at http://en.wikipedia.org/wiki/Pseudoautosomal_region
3. PAR - PseudoAutosomal Regions - are located at the ends of the X and Y chromosome. The DNA in these regions can recombine between the sex chromosomes, hence PAR genes demonstrate autosomal inheritance patterns, rather than sex inheritance patterns. You can read more at http://en.wikipedia.org/wiki/Pseudoautosomal_region
3 comments:
You know if you bgzip the vcf file and index it using tabix (see samtools at sourceforge for both) you get a compact binary indexed version of the file which you can query very easily without having to change formats
Thanks for the tip bagpuss. I'm honored to be visited by the famous British cat. While tabix and bgzip are nice tools for compressing and accessing specific rows (based on genomic positions) of data from vcf, gff, bed and a few other bioinformatics tab delimited files, they do not address that larger issues that come with mining data from these files.
For example, the NCBI VCF file only follows the VCF standard to a small degree. The annotation data are under the INFO column, but these data have their own structure with ";" delimiters. To count the numbers of variants with annotations, or number of variants having a given annotation, these tags must be parsed and represented as a table from which columns can be queried. Similarly getting a distribution of the numbers of alleles at each position requires that a program count the comma's that separate the sequences and, if you want to persist this information, it needs to be added to a table in a column that does not exist in the VCF file.
I used HDF5 and pytables, because I'm familiar with the interface, and the system is high performing and powerful. The entire analysis was done with less than 500 lines of really simple python code like:
print "KGPilot1\t", sum([annottable.col('KGPilot1')])
print "KGPilot123\t", sum([annottable.col('KGPilot123')])
And, operated on a data file that was only 20% larger (non-optimized) than the same file compressed and indexed with bgzip and tabix. So, I got similar efficiency while increasing my options for querying the data.
Why the discrepency in SNP count between VCF and the DB ? I guess because of the constraints on the VCF file, as explained in the 000README file :
The following are omitted from 00-All.vcf.gz
SNPs listed as microsatellites or named variations
SNPs with multibyte alleles and unknown (N) adjacent base pairs
SNPs that are not mapped on the reference genome (GRCh37)
Post a Comment