Wednesday, March 23, 2011

Translational Bioinformatics

During the week of March 7, I had the pleasure of attending the AMIA’s (American Medical Informatics Association) summit on Translational Bioinformatics (TBI), at the Parc 55 Hotel in San Francisco.

What is Translational Bioinformatics? 
Translational Bioinformatics can be simply defined as computer related activities designed to extract clinically actionable information from very large datasets. The field has grown from a need to develop computational methods to work with continually increasing amounts of data and an ever expanding universe of databases

As we celebrate the 10th anniversary of completing the draft sequence of the human genome [1,2] we are often reminded that this achievement would transform medicine. The genome would be used to develop a comprehensive catalog of all genes and, through this catalog, we would be able to identify all disease genes and develop new therapies to combat disease. However, since the initial sequence, we have also witnessed an annual decrease in new drugs entering the market place. While progress is being made, it's just not moving at speeds consistent with the excitement produced by the first and “final” drafts [3,4] of the human genome. 

What happened?
Biology is complex. Through follow on endeavors, especially with the advent of massively parallel low-cost sequencing, we’ve begun to expose the enormous complexity and diversity of the nearly seven billion genomes that comprise the human species. Additionally, we’ve begun to examine the hundred trillion, or so, microbial genomes that make a human “ecosystem.” A theme that has become starkly evident is our health and wellness is controlled by our genes and how they are modified and expressed in response to environmental factors. Or as described in one slide,

Conferences, like the Joint Summits on Translational Bioinformatics and Clinical Research Informatics, create a forum for individuals working on clinically related computation, bioinformatics and medical informatics problems to come together and share ideas. This year’s meeting was the fourth annual. The TBI meeting had four tracks: 1) Concepts, Tools and Techniques to Enable Integrative Translational Bioinformatics, 2) Integrative Analysis of Molecular and Clinical Measurements, 3) Representing and Relating Phenotypes and Disease for Translational Bioinformatics, and 4) Bridging Basic Science Discoveries and Clinical Practice. To simplify these descriptions, I’d characterize the attendees as participating in five kinds of activities:
  • Creating new statistical models to analyze data
  • Integrating diverse data to create new knowledge 
  • Promoting ontologies and standards 
  • Developing social media infrastructures and applications 
  • Using data to perform clinical research and practice
As translational bioinformatics is about analyzing and mining data to develop clinical knowledge and applications, conference attendees represent a broad collection of informatics domains. Statisticians reduce large amounts of raw data into results that need to be compared and integrated with diverse information to functionally characterize biology. This activity benefits from standardized data descriptions and clinical phenotypes that are organized into ontologies. Because this endeavor requires large teams of clinical researchers, statisticians, and bioinformaticians, social media plays a significant role in finding collaborators and facilitating communication and data sharing between members of large de-centralized projects. Finally, it’s not translational unless there is clinical utility.

What did we learn?
The importance of translational bioinformatics is growing. This year’s summit had about 470 attendees with nearly 300 attending TBI, a 34% growth over 2010 attendance. In addition, to many talks and posters on databases, ontologies, statistical methods, and clinical associations of genes and genotypes, we were entertained by Carl Zimmer’s keynote on the human microbiome.

In his presentation, Zimmer made the case that we need to study human beings as ecosystems. After all our bodies contain between 10 and 100 times more microbial cells than human cells. Zimmer showed how our microbiome becomes populated, from an initially sterile state, through the same niche and succession paradigms that have been demonstrated in other ecosystems. While microbial associations with disease are clear, it is important to note that our microbiome protects us from disease and performs a large number of essential biochemical reactions. In reality our microbiome serves as an additional organ - or two. Thus, to really understand human health, we need to understand the microbiome, so in terms of completing the human ecosystem, we have only peaked at the tip of a very large iceberg, which only gets bigger when we consider bacteriophage and viruses.

Russ Altman closed the TBI portion of the joint summit with his, now annual, year in review. The goals of this presentation were to highlight major trends and advances of the past year, focus on what seems to be important now, and predict what might be accomplished in the coming year. From ~63 papers, 25 were selected for presentation. These were organized into Personal Genomics, Drugs and Genes, Infrastructure for TB, Sequencing and Science, and Warnings and Hope. You can check out the slides to read the full story. 

My take away is that we’ve clearly initiated personal genomics with both clinical and do-it-yourself perspectives. Semantic standards will improve computational capabilities, but we should not hesitate to mine and use data from medical records, participant driven studies, and previously collected datasets in our association studies. Pharmacogenetics/genomics will change drug treatments from one-size-fits-all-benifits-few approaches to specific drugs for stratified populations and multi-drug therapies will become the norm. Deep sequencing continues to reveal deep complexity in our genome, cancer genomes, and the microbiome.

Altman closed with his 2011 predictions. He predicted that consumer sequencing (vs. genotyping) will emerge, cloud computing will contribute to a major biomedical discovery, informatics application to stem cell science will emerge, important discoveries will be made from text mining along with population-based data mining, systems modeling will suggest useful polypharmacy, and immune genomics will emerge as powerful data.

I expect many of these will come true as our [Geospiza's] research and development, and customer engagements are focused on many of the above targets.

Consortium efforts (2001). Human genome sequencing issues: Nature, 409 (6288) DOI: Nature, Vol. 409, no 6822 pp. 745-964

1. Nature, Vol. 409, no 6822 pp. 745-964
2. Science, Vol. 291, no. 5507 pp. 1145-1434
3. Nature, Vol. 422, no. 6934 pp. 835-847
4. Science, Vol. 300, no. 5617 pp. 286-290


Thursday, March 10, 2011

Sneak Peak: The Next Generation Challenge: Developing Clinical Insights Through Data Integration

Next week (March 14-18, 2011) is CHI's X-Gen Congress & Expo. I'll be there presenting a poster on the next challenge in bioinformatics, also known as the information bottleneck.

You can follow the tweet by tweet action via @finchtalk or #XGenCongress.

In the meantime, enjoy the poster abstract.

The next generation challenge: developing clinical insights through data integration

Next generation DNA sequencing (NGS) technologies hold great promise as tools for building a new understanding of health and disease. In the case of understanding cancer, deep sequencing provides more sensitive ways to detect the germline and somatic mutations that cause different types of cancer as well as identify new mutations within small subpopulations of tumor cells that can be prognostic indicators of tumor growth or drug resistance. Intense vendor competition amongst NGS platform and service providers are commoditizing data collection costs making data more assessable. However, the single greatest impediment to developing relevant clinical information from these data is the lack of systems that create easy access to the immense bioinformatics and IT infrastructures needed for researchers to work with the data. 

In the case of variant analysis, such systems will need to process very large datasets, and accurately predict common, rare, and de novo levels of variation. Genetic variation must be presented in an annotation-rich, biological context to determine the clinical utility, frequency, and putative biological impact. Software systems used for this work must integrate data from many samples together with resources ranging from core analysis algorithms to application specific datasets to annotations, all woven into computational systems with interactive user interfaces (UIs). Such end-to-end systems currently do not exist, but the parts are emerging. 

Geospiza is improving how researchers understand their data in terms of its biological context, function and potential clinical utility, by develop methods that combine assay results from many samples with existing data and information resources from dbSNP, 1000 Genomes, cancer genome databases, GEO, SRA and others. Through this work, and follow on product development, we will produce integrated sensitive assay systems that harness NGS for identifying very low (1:1000) levels of changes between DNA sequences to detect cancerous mutations, emerging drug resistance, and early-stage signaling cascades.

Authors: Todd M. Smith(1), Christoper Mason(2) 
(1). Geospiza Inc. Seattle WA 98119, USA. 
(2). Weil Cornell Medical College, NY NY 10021, USA

Thursday, March 3, 2011

The Flavors of SNPs

In Blink, Malcolm Gladwell discusses how experts do expert things. Essentially they develop granular languages to describe the characteristics of items, or experiences. Food tasters, for example, use a large and rich vocabulary with scores to describe a food’s aroma, texture, color, taste, and other attributes.

We characterize DNA variation in a similar way

In a previous post, I presented a high level analysis of dbSNP, NCBI’s catalog of human variation. The analysis utilized the VCF (variant call format) file that holds the entire collection of variants and their many annotations. From these data we learned about dbSNP’s growth history, the distribution of variants by chromosome, and additional details such as the numbers of alleles that are recorded for a variant or its structure. We can further classify dbSNP’s variants, like flavors of coffee, by examining the annotation tags that accompany each one.

So, how do they taste?

Each variant in dbSNP can be accompanied by one or more annotations (tags) that define particular attributes, or things we know about a variant. Tags are listed at top of the VCF file in lines that begin with “##INFO.” There 49 such lines. The information about each tag, it’s name, type, and a description is included between angle (<>) brackets. Tag names are coded in alphanumeric values. Most are simple flags, but some include numeric (integer or float) values.

Tags can also be grouped (arbitrarily) into categories to further understand the current state of knowledge about dbSNP. In this analysis I organized 42 of the 49 tags into six categories called: clinical value, link outs, gene structure, bioinformatics issues, population biology (pop. bio.), and 1000 genomes. The seven excluded tags either described house keeping issues (dbSNPBuildID), structural features (RS, VC), non-human readable bitfields (VP), or fields that do not seem to be used or have the same value for every variant (NS [not used], AF [always 0], WGT [always 1]).

By exploring the remaining 42 tags we assess our current understanding about human variation. For example, approximately 10% of the variants in the database lack tags. For these, we only know that that they have been found in some experiment somewhere. The most common number of tags for a variant is two, and a small number of variants (148,992) have more than ten tags. 88% of variants in the database have between one and ten tags. Put another way, one could say that 40% of the variants are poorly characterized having between zero and two tags, 42% are moderately characterized having between three and six tags, and 16% well characterized, having seven or more tags.

We can add more flavor

We can also count the numbers of variants that fall into categories of tags. A very small number 135,473 (0.5%) have tags that describe possible clinical value. Clinical value tags are applied to variants that are known diagnostic markers (CLN, CDA), have clinical PubMed citations (PM), exist in the PharmGKB database (TPA), are cited as a mutations from a reputable source (MUT), exist in locus specific databases (LSD), have an attribution from a genome wide association study (MTP), or exist in the OMIM/OMIA database (OM). Interestingly 20,544 variants are considered clinical (CLN), but none are tagged as being “interrogated in a clinical diagnostic assay” (CDA). Also, while I’ve grouped 0.5% of the variants in the clinical value category, the actual number is likely lower, because the tag definitions contain obvious overlaps with other tags in this category.

2.3 million variants are used as markers on high density genotyping kits (HD) - from the VCF file we don’t know whose kits, just that they're used. The above graph places them in this category, but they are not counted with the clinical value tags, because we only know they are a marker on some kit somewhere. Many variants (37%) contain links to other resources such as 3D structures (S3D), a PubMed Central article (PMC), or somewhere else that is not documented in the VCF file (SLO). Additional details about the kits and links can likely be found in individual dbSNP records.

Variants for which something is known about how they map within or near genes account for 42% of the database. Gene structure tags identify variants that result in biochemical consequences such as frameshift (NSF), missense (amino acid change, NSM), nonsense (conversion to sstop codon, NSN), or splice donor and acceptor sequence (DSS, ASS) changes. Coding region variations that do not result in amino changes, or truncated proteins are also identified (SYN, REF). In addition to coding regions, variants can also be found in 5’ and 3’ untranslated regions (U5, U3), proximal and distal to genes (R5, R3), and within introns (INT). Not surprisingly, nearly 37% of the variants in map to introns.

Bioinformatics issues form an interesting category of tags. These include tags that describe a variant’s mapping to different assemblies of the human genome (OTH, CFL, ASP), whether a variant is a unique contig allele (NOC), whether variants have more than two alleles from different submissions (NOV), or if a variant has a genotype conflict (GCF); 22% of the variants are tagged with one or more of these issues. Bioinformatics issues speak to the fact that reference sequence is much more complicated than we like to think.

Because my tag grouping is arbitrary, I’ve also included validated variants (VLD) as a bioinformatics issue. In this case it's a positive issue, because 30% of dbSNP’s have been validated in someway. Of course, a more pessimistic view will see this as 70% are not validated. Indeed, from various communications, dbSNP may have a false positive rate that is on the order of 10%.

The last two categories organize tags by population biology and variants contributed by the 1000 Genomes Project. Population biology tags G5A and G5 describe 5% minor allele frequencies computed by different methods. GNO variants are those that have a genotype available. Over 24 million (85%) variants have been delivered through the 1000 Genome Project's two pilot phases (KGPilot1, KGPilot123). Of these approximately 4 million have been genotyped (PH1). The 1000 genomes category also includes the greatest number of unused tags as this, like everything else we are doing in genomics today, is a work in progress. 

Why is this important?

When next generation sequencing is discussed, is often accompanied by a conversation about the bioinformatics bottleneck, which deals with the computation phase of data analysis where raw sequences are reduced to quantitative values or lists of variation. Some also discuss a growing information bottleneck that refers to the interpretation of the data. That is, what do those patterns of variation or lists of differentially expressed genes mean? Developing gene models of health and disease and other biological insights requires that assay data can be integrated with other forms of existing information. However, this information is held in a growing number of specialized databases and literature that are scattered and changing rapidly. Integrating such information into systems, beyond simple data links, will require deep characterization of each resource to put assay data into a biological context.

The work described in this and the previous post, provides a short example of the kinds of things that will need to be done with many different databases to increase their utility in aiding in the development of biological models that can be used to link genotype to phenotype.

Further Reading

Sherry ST, Ward M, & Sirotkin K (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9 (8), 677-9 PMID: 10447503

1000 Genomes Project Consortium, Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, & McVean GA (2010). A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-73 PMID: 20981092

Tuesday, March 1, 2011

Data Analysis for Next Generation Sequencing: Challenges and Solutions

Join us next Tuesday, March 8 at 10:00 am Pacific Time for a webinar on Next Gen sequence data analysis.

Register Today!

Please note the previous post had Wed. listed for the day. The correct day is Tue.