Thursday, March 3, 2011

The Flavors of SNPs

In Blink, Malcolm Gladwell discusses how experts do expert things. Essentially they develop granular languages to describe the characteristics of items, or experiences. Food tasters, for example, use a large and rich vocabulary with scores to describe a food’s aroma, texture, color, taste, and other attributes.

We characterize DNA variation in a similar way

In a previous post, I presented a high level analysis of dbSNP, NCBI’s catalog of human variation. The analysis utilized the VCF (variant call format) file that holds the entire collection of variants and their many annotations. From these data we learned about dbSNP’s growth history, the distribution of variants by chromosome, and additional details such as the numbers of alleles that are recorded for a variant or its structure. We can further classify dbSNP’s variants, like flavors of coffee, by examining the annotation tags that accompany each one.

So, how do they taste?

Each variant in dbSNP can be accompanied by one or more annotations (tags) that define particular attributes, or things we know about a variant. Tags are listed at top of the VCF file in lines that begin with “##INFO.” There 49 such lines. The information about each tag, it’s name, type, and a description is included between angle (<>) brackets. Tag names are coded in alphanumeric values. Most are simple flags, but some include numeric (integer or float) values.

Tags can also be grouped (arbitrarily) into categories to further understand the current state of knowledge about dbSNP. In this analysis I organized 42 of the 49 tags into six categories called: clinical value, link outs, gene structure, bioinformatics issues, population biology (pop. bio.), and 1000 genomes. The seven excluded tags either described house keeping issues (dbSNPBuildID), structural features (RS, VC), non-human readable bitfields (VP), or fields that do not seem to be used or have the same value for every variant (NS [not used], AF [always 0], WGT [always 1]).

By exploring the remaining 42 tags we assess our current understanding about human variation. For example, approximately 10% of the variants in the database lack tags. For these, we only know that that they have been found in some experiment somewhere. The most common number of tags for a variant is two, and a small number of variants (148,992) have more than ten tags. 88% of variants in the database have between one and ten tags. Put another way, one could say that 40% of the variants are poorly characterized having between zero and two tags, 42% are moderately characterized having between three and six tags, and 16% well characterized, having seven or more tags.

We can add more flavor

We can also count the numbers of variants that fall into categories of tags. A very small number 135,473 (0.5%) have tags that describe possible clinical value. Clinical value tags are applied to variants that are known diagnostic markers (CLN, CDA), have clinical PubMed citations (PM), exist in the PharmGKB database (TPA), are cited as a mutations from a reputable source (MUT), exist in locus specific databases (LSD), have an attribution from a genome wide association study (MTP), or exist in the OMIM/OMIA database (OM). Interestingly 20,544 variants are considered clinical (CLN), but none are tagged as being “interrogated in a clinical diagnostic assay” (CDA). Also, while I’ve grouped 0.5% of the variants in the clinical value category, the actual number is likely lower, because the tag definitions contain obvious overlaps with other tags in this category.

2.3 million variants are used as markers on high density genotyping kits (HD) - from the VCF file we don’t know whose kits, just that they're used. The above graph places them in this category, but they are not counted with the clinical value tags, because we only know they are a marker on some kit somewhere. Many variants (37%) contain links to other resources such as 3D structures (S3D), a PubMed Central article (PMC), or somewhere else that is not documented in the VCF file (SLO). Additional details about the kits and links can likely be found in individual dbSNP records.

Variants for which something is known about how they map within or near genes account for 42% of the database. Gene structure tags identify variants that result in biochemical consequences such as frameshift (NSF), missense (amino acid change, NSM), nonsense (conversion to sstop codon, NSN), or splice donor and acceptor sequence (DSS, ASS) changes. Coding region variations that do not result in amino changes, or truncated proteins are also identified (SYN, REF). In addition to coding regions, variants can also be found in 5’ and 3’ untranslated regions (U5, U3), proximal and distal to genes (R5, R3), and within introns (INT). Not surprisingly, nearly 37% of the variants in map to introns.

Bioinformatics issues form an interesting category of tags. These include tags that describe a variant’s mapping to different assemblies of the human genome (OTH, CFL, ASP), whether a variant is a unique contig allele (NOC), whether variants have more than two alleles from different submissions (NOV), or if a variant has a genotype conflict (GCF); 22% of the variants are tagged with one or more of these issues. Bioinformatics issues speak to the fact that reference sequence is much more complicated than we like to think.

Because my tag grouping is arbitrary, I’ve also included validated variants (VLD) as a bioinformatics issue. In this case it's a positive issue, because 30% of dbSNP’s have been validated in someway. Of course, a more pessimistic view will see this as 70% are not validated. Indeed, from various communications, dbSNP may have a false positive rate that is on the order of 10%.

The last two categories organize tags by population biology and variants contributed by the 1000 Genomes Project. Population biology tags G5A and G5 describe 5% minor allele frequencies computed by different methods. GNO variants are those that have a genotype available. Over 24 million (85%) variants have been delivered through the 1000 Genome Project's two pilot phases (KGPilot1, KGPilot123). Of these approximately 4 million have been genotyped (PH1). The 1000 genomes category also includes the greatest number of unused tags as this, like everything else we are doing in genomics today, is a work in progress. 

Why is this important?

When next generation sequencing is discussed, is often accompanied by a conversation about the bioinformatics bottleneck, which deals with the computation phase of data analysis where raw sequences are reduced to quantitative values or lists of variation. Some also discuss a growing information bottleneck that refers to the interpretation of the data. That is, what do those patterns of variation or lists of differentially expressed genes mean? Developing gene models of health and disease and other biological insights requires that assay data can be integrated with other forms of existing information. However, this information is held in a growing number of specialized databases and literature that are scattered and changing rapidly. Integrating such information into systems, beyond simple data links, will require deep characterization of each resource to put assay data into a biological context.

The work described in this and the previous post, provides a short example of the kinds of things that will need to be done with many different databases to increase their utility in aiding in the development of biological models that can be used to link genotype to phenotype.

Further Reading

Sherry ST, Ward M, & Sirotkin K (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9 (8), 677-9 PMID: 10447503

1000 Genomes Project Consortium, Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, & McVean GA (2010). A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-73 PMID: 20981092

No comments: