Tuesday, December 4, 2012

Commonly Rare

Rare is the new common. The final month of the year is always a good time to review progress and think about what's next.  In genetics, massively parallel next generation sequencing (NGS) technologies have been a dominating theme, and for good reason.

Unlike the previous high-throughput genetic analysis technologies (Sanger sequencing and microarrays), NGS allows us to explore genomes in far deeper ways and measure functional elements and gene expression in global ways.

What have we learned?

Distribution of rare and common variants. From [1] 
The ENCODE project has produced a picture where a much greater fraction of the genome may be involved in some functional role than previously understood [1]. However, a larger theme has been related to observing rare variation, and trying to understand its impact on human health and disease. Because the enzymes that replicate DNA and correct errors are not perfect, each time a genome is copied a small number of mutations are introduced, on average between 35-80. Since sperm are continuously produced, fathers contribute more mutations than mothers, and the number of new mutations increases with the father's age [2]. While the number per child, with respect to their father's contributed three-billion base genome, is tiny, rare diseases and intellectual disorders can result.

A consequence is that the exponentially growing human population has accumulated a very large number of rare genetic variants [3]. Many of these variants can be predicted to affect phenotype and many more may modify phenotypes in yet unknown ways [4,5].  We are also learning that variants generally fall into two categories. They are either common to all populations or confined to specific populations (figure). More importantly, for a given gene the number of rare variants can vastly outnumber of the number of previously known common variants.

Another consequence of the high abundance of rare variation is how it impacts the resources that are used to measure variation and map disease to genotypes.  For example, microarrays, which have been the primary tool of genome wide association studies utilize probes developed from a human reference genome sequence. When rare variants are factored in, many probes have several issues ranging from "hidden" variation within a probe to a probe simply not being able to measure a variant that is present. Linkage block size is also affected [6]. What this means it the best arrays going forward will be tuned to specific populations. It also means we need to devote more energy to developing refined reference resources, because the current tools do not adequately account for human diversity [6,7].

What's next?

Rare genetic variation has been understood for sometime. What's new is understanding just how extensive these variants are in the human population, which has resulted from the population recently rapidly expanding under very little selective pressure.  Hence, linking variation to heath and disease is the next big challenge and the cornerstone of personalized medicine, or as some would like precision medicine. Conquering this challenge will require detailed descriptions of phenotypes, in many cases at the molecular level. As the vast majority of variants, benign or pathogenic, lie outside of coding regions we will need to deeply understand how those functional elements, as initially defined by ENCODE, are affected by rare variation. We will also need to layer in epigenetic modifications.

For the next several years the picture will be complex.


1. 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), 56-65 PMID: 23128226

[2] Kong, A., et. al. (2012). Rate of de novo mutations and the importance of father’s age to disease risk Nature, 488 (7412), 471-475 DOI: 10.1038/nature11396

[3] Keinan, A., and Clark, A. (2012). Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants Science, 336 (6082), 740-743 DOI: 10.1126/science.1217283

[4] Tennessen, J., et. al. (2012). Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes Science, 337 (6090), 64-69 DOI: 10.1126/science.1219240

[5] Nelson, M., et. al. (2012). An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People Science, 337 (6090), 100-104 DOI: 10.1126/science.1217876

[6] Rosenfeld JA, Mason CE, and Smith TM (2012). Limitations of the human reference genome for personalized genomics. PloS one, 7 (7) PMID: 22811759

[7] Smith TM., and Porter SG. (2012) Genomic Inequality. The Scientist.

Sunday, August 5, 2012

Remembering Chris Abajian

Chris Abajian was a change catalyst.  Using a biochemical analogy, passion, creativity, and intellect were his catalytic triad. Together, with Joe Slagel, Chris, and I started Geospiza in 1997.  Sadly, Chris recently died in a hiking accident (7/30/12).  In remembrance, I'll share a few stories from our times together.

I met Chris during my Postdoc in Leroy Hood's laboratory in 1994.  This was the early days of the human genome project and we hired Chris, because in Lee's view we were going to build the best software if we had professional software engineers on the team. Lee was right.

Chris accepted our offer and from his first day, he made it clear this was not just a job where he could  apply his software development talents, it was an opportunity to have an impact.  And he did.


Chris used his passion, creativity and intellect to identify problems that needed to be solved, and then advocate creative solutions, passionately. One of his first programs was Sputnik - a tool that identifies microsatellite sequences.  Sputnik was inspired by a co-worker of ours Lee Rowen.  One day Chris observed Lee hunched over a ream of paper with printed DNA sequences one hand and a highlighter in the other. When he inquired as to what she was doing she responded "identifying microsatelites."

New to biology Chris asked what those were.  Lee explained that they are small repeating patterns of di, tri, tetra, or slightly longer sequences and that we are interested in them because they can be involved in disease and change gene regulation.  Chris quickly went to work. He talked to everyone around, learned that the repeated patterns were not always perfect and used this information to develop an algorithm and scoring table that could identify micro-satellite patterns with pretty good accuracy.  

Today a google search on "sputnik microsatellite" or "sputnik-microsatelite" yields ~191,000 or ~65,000 hits, respectively.  What's even more interesting is the number of papers, 12 and 15 years later, that compare different microsatellite algorithms to sputnik [1,2].  Not bad for a music major without any formal biology training!


After sputnik, Chris turned his attention to the next problem.  This was in the early days of Phred and Phrap (P. Green, still unpublished) and we had no way to work with DNA sequence assemblies in graphical user interface (GUI).   Not everyone was convinced we needed to build a whole new application. It would be a significant undertaking and other tools could be hacked to view Phrap assemblies.  This is when we learned that when Chris set out to do something, he was going to get it done and do so convincingly. To Chris, and some others, it was clear Phrap needed its own GUI, so he set to work, debated the points and got buy-in. In collaboration with David Gordon, Chris proceeded to build Consed.  Chris worked on the project for only a short time, but the work was a success. 17 years later, David has developed a large loyal user base and continues to develop new features for Consed [3].

The hunt for BRCA1

Chris and I worked closely on many software development projects, starting with data delivery for BRCA1. In 1994, we were asked to help "hunt" for the gene. In collaboration with Mary-Claire King, Francis Collins, Maynard Olson, and Lee Hood, we set out to find the BRCA1 gene. It had been previously localized to a large region of chromosome 17 by the King and Collins groups, and they had created a cosmid library of the region. With the high-throughput sequencing technology of 1994 we could include DNA sequencing in our strategy; one cosmid at a time.  So, our job in the lab was to get cosmid DNA clones from the King and Collins libraries, sequence them and make the data available to everyone -  simultaneously. How were we going to do that?

With web-technology

In 1994 the Mosiac web browser was new.  Chris suggested that we could post the sequences to website and send emails to the respective parties when the data was posted. Problem solved!

During this time, I was learning to program and developing automation systems. It was a no brainer and we set to work, Chris created a framework that I could use to create automation scripts. This was going to be a theme that would result in several more successful projects and lead to the next adventure. 


One day in April of 1997, Chris, Joe, and myself squeezed into the cab of Chris's small Toyota truck and headed to airport to interview with a new bioinformatics company called Pangea.  We were hired, but it was soon clear that we needed to do something different.  After a few rounds of passionate conversation we knew we were going to form a company, and we did.

Geospiza started in October of 1997 and while Chris was with us for only a short period of time, he made contributions that would last. Geospiza continues, now within PerkinElmer and there are probably still a few lines of his original code working within our LIMS system.

It's amazing to think about his accomplishments over the four years we spent together.  We enjoyed many good times discussing science, literature, and music. Chris will be missed.

1. Kofler, R. (2007-07-01) SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics, 7(4), 524-1685. DOI: 10.1093/bioinformatics/btm157

2. Leclercq, S. (2007) Detecting microsatellites within genomes: significant variation among algorithms. BMC Bioinformatics, 8(1), 125. DOI: 10.1186/1471-2105-8-125

3. David Gordon, Chris Abajian, and Phil Green. Consed: a graphical tool for sequence finishing. Genome Res. 1998. 8: 195-202

Thursday, July 12, 2012

Resources for Personalized Medicine Need Work

Yesterday (July 11, 2012), PLoS ONE published an article prepared by my colleagues and myself entitled "Limitations of the Reference Genome for Personalized Genomics."

This work, supported by Geospiza's SBIR targeting ways to improve mutation detection and annotation, explored some the resources and assumptions that are used to measure and understand sequence variation.  As we know, a key deliverable of the human genome project was to produce a high quality reference sequence that could be used to annotate genes, develop research tools like genotyping and microarray assays, and provide insights to guide software development. Projects like HapMap used these resources to provide additional understandings in terms of genetic linkage in populations.

Decreasing sequencing costs
Since those early projects, DNA sequencing costs have plummeted.  As a result, endeavors such as the 1000 Genomes Project (1KGP) and public contributions from Complete Genomics (CG) have dramatically increased the number of known sequence variants.  A question worth asking is how do these new data contribute to an understanding of the utility of current resources and assumptions that have guided genomics and genetics for the past six or seven years?

Number of variants by dbSNP build
To address the above question, we evaluated several assay and software tools that were based on the human genome reference sequence in the context of new data contributed by 1KGP and CG. We found a high frequency of confounding issues with microarrays, and many cases where invalid assumptions, encoded in bioinformatics programs, underestimate variability or possibly misidentify the functional effects of mutations. For example, 34% of published array-based GWAS studies for a variety of diseases utilize probes that contain undocumented variation or map to regions of previously unknown structural variation. Similarly, assumptions about the size of linkage disequillibrium decrease as the numbers of variants increase.

The significance of this work is that it documents what many are anecdotally experiencing. As we continue to learn about the contributing role of rare variation in human disease we need to fully understand how current resources can be used and work to resolve discrepancies in order to create an era of personalized medicine.

(2012). Limitations of the Human Reference Genome for Personalized Genomics, PLoS ONE,   DOI: 10.1371/journal.pone.0040294.t002

Tuesday, May 8, 2012

FinchTV on Lion

Okay, it took a while, but FinchTV, Geospiza's popular Sanger sequencing trace viewer is now available on Mac OS X Lion.

While the same great features are still great, the underlying code has been updated to run the application as a native intel binary, so it can be great into the future.

Features make FinchTV cool

In addition to all the basic things you'd expect a trace viewer to do like open AB1 or SCF files, view bases, electropherogram peaks, quality values, and reverse complement sequences, and dynamically scale data, FinchTV also lets you edit bases, print traces, and view detailed information about your trace file.  And, you can open files with a simple drag and drop action, as you'd expect for a modern desktop application.

That's not all, there's more

What really makes FinchTV stand out, is the ability to view your trace in a single-pane or multi-pane view. The latter view is ideal for visualizing the full data contained in the trace. In multi-pane view you can even change the horizontal or vertical scales.

FinchTV also integrates with NCBI's BLAST services.  In the application you can highlight a region of sequence and either use the edit menu or right click with your mouse to get the BLAST menu to choose between nucleotide (BLASTn), translated nucleotide (BLASTx), translated nucleotide + translated database (TBLASTx), or mega BLAST options.

Less obvious features include the ability to search for sub sequences in your DNA sequence and observe the raw data for a trace. Subsequence searching uses Perl style regular expressions and a Greedy algorithm so you can enter a search term and, each time you hit return, the next best match is identified including subsequences within subsequences. For example, the regular expression - ATG((?!TAG|TAA|TGA)...)+(TAG|TAA|TGA) - will find open reading frames. If one reading frame is contained within another, the longer reading frame is found first.

Another less known feature is the raw data view.  The raw data view is essential for those work with sequencing instruments.  Many do not realize the standard electropherogram trace image is processed data. That is, a mathematical matrix computation is applied to normalize signals, subtract background fluorescence, and correct natural mobility shifts in the data. The result is an easy to interpret view of the data. However, if you need to troubleshoot your sequencer and see the real details you'll want to view the rawest data possible. FinchTV is the only available trace viewer, outside of ABI's software that provides this capability.

To obtain FinchTV please visit http://www.geospiza.com/Products/finchtv.shtml

Sunday, April 22, 2012

Sneak Peak: A Practical Approach to Detecting Nucleotide Variants in NGS Data

Join us Thursday, May 3, 2012 9:00 am (Pacific Time) for a webinar on analyzing DNA sequencing data with hundreds of thousands to millions of nucleotide variants.

This webinar discusses DNA variant detection using Next Generation Sequencing for targeted and exome resequencing applications as well as, whole transcriptome sequencing. The presentation includes an overview of each application and its specific data analysis needs and challenges with a particular emphasis on variant detection methods and approaches for individual samples as well as multi-sample comparisons. For in depth comparisons of variant detection methods, Geospiza’s cloud-based GeneSifter® Analysis Edition software will be used to assess sample data from NCBI’s GEO and SRA.

For more information, please visit the registration page.

Friday, March 2, 2012

You Have 500 Million Reads – Now What?

Come find out at our two Seattle seminars on March 6 and March 13. In these seminars we will show you how you can get answers to your RNA-Seq and Exome Analysis questions.

From the brochure:

Learn how GeneSifter® Analysis Edition helps you manage the bioinformatics workload, so you can focus on scientific discovery. GeneSifter combines comprehensive data analysis with meaningful visualization tools, to get the most out of the Next Generation Sequencing and Microarray Data.

Please visit http://www.perkinelmer.com/GeospizaSeminar to register.

Tuesday, February 14, 2012

Sneak Peek: Poster Presentations at AGBT

The annual Advances in Genome Biology and Technology (AGBT) begins tomorrow and would not be complete without a couple of contributions by @finchtalk.

Follow the tweets at #AGBT and if you are at the conference visit posters 334 and 335 (abstracts below). Also, visit Lanai 189 to see the latest advances in genome technology and software from the Caliper and Geospiza organizations within PerkinElmer. 

Poster Abstracts

Poster 335: Why is the $1000 Genome so Expensive? 

Rapid advances in sequencing technology are enabling leading institutions to establish programs for genomics-based medicine. Some estimate that 5000 genomes were sequenced during 2011, and an additional 30,000 will be sequenced by the end of 2012. Despite this terrific progress, the infrastructure required to make genomics-based medicine a norm, rather than a specialized application, are lacking. Although DNA sequencing costs are decreasing, sample preparation bottlenecks and data handling costs are increasing. In many instances, the resources (e.g. time, capital investment, experience) required to effectively conduct medical-based sequencing is prohibitive.

We describe a model system that uses a variety of PerkinElmer products to address three problems that continue to impact the widescale adoption of genomics-based medicine: organizing and tracking sample information, sample preparation, and whole genome data analysis. Specifically, PerkinElmer’s GeneSifter® LIMS and analysis software, Caliper instrumentation, and DNA sequencing services can provide independent or integrated solutions for generating and processing data from whole-genome sequencing.

Poster 334: Limitations of the Human Reference Genome Sequence

The human genome reference sequence is well characterized, highly annotated, and its development represents a considerable investment of time and money. This sequence is the foundation for genotyping microarrays and DNA sequencing analysis. Yet, in several critical aspects the reference sequence remains incomplete as are the many research tools that are based on it. We have found that, when new variation data from 1000 Genome Project (1Kg) and Complete Genomics (CG) are used to measure the effectiveness of existing tools and concepts, approximately 50% of probes on commonly used genotyping arrays contain confounding variation, impacting the results of 37% of GWAS studies to date. The sources of confounding variation include unknown variants in close proximity to the probed variant and alleles previously assumed to be di-allelic that are poly-allelic. When mean linkage disequillibrium (LD) lengths from HapMap are compared to 1Kg data, LD decreases from 16.4 Kb to 7.0 Kb within common samples and further decreases to 5.4 Kb when random samples are compared.

While many of the observations have been anecdotally understood, quantitative assessments of resources based on the reference sequence have been lacking. These findings have implications for the study of human variation and medical genetics, and ameliorating these discrepancies will be essential for ushering in the era of personalized medicine.

Thursday, January 5, 2012

Bio Databases 2012

ResearchBlogging.orgLet's get 2012 started with an update on the growth of biological databases.  About this time last year, I summarized Nucleic Acids Research's (NAR) annual database issue where authors submit papers describing updates to existing databases and present new databases. In that post, I predicted that we would see between 60 and 120 new databases in 2011.  This year's update included 92 new databases [1].

How have things changed?
Overall the number of databases being tracked by NAR has grown from 1330 in 2011 to 1380 in 2012 (left hand figure). As 92 new databases were added, 42 must have been dropped. Interestingly, when one views the new database list page only 90 databases are listed.  I not only counted this using my fingers and toes - more than twice, but also copied the list into Excel to verify my original count. I wonder what the two, but unaccounted, new databases are? Never mind the 42 that disappeared, those are not listed. 

What are the new databases?

Reviewing the list of the 90 new databases, does not reveal any clear patterns or trends regarding changes in biology. Instead, the list reflects increasing complexity and specialization. Some databases tackle new kinds of data, many appear to be refinements of existing databases, and others contain highly specific information. For example, the UMD-BRCA1/ BRCA2 database contains information about BRCA1 and BRCA2 mutations detected in France.  A few databases, such as the BitterDB: a database of bitter taste molecules and receptors,  and Newt-omics: data on the red spotted newt (Notophthalmus viridescens), and IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature, caught my attention because of their interesting descriptions. I especially like Disease Ontology: Ontology for a variety of human diseases. I wonder which diseases are their favorites?

Also of interest is the growth of wiki's as databases. This year, 10 of the new databases are wiki's where the community is invited to make contributions to the resource in a similar fashion to wikipedia.  I should say almost 10 of the databases are wiki's, one, SeqAnswers is actually a forum, that's technically not a wiki, of course some might argue that a wiki might not be a database.

So, what does this mean?

The growing list of databases is an interesting way to think about biology, DNA, and proteins. Simply examining the list is instructional and one can think of ways that mining several resources together could create new insights.  However, therein lies the challenge.  Many of these resources are not designed to interoperate and it is not clear how long they will last or be updated. This final point was made at the close of the introduction article. In a section entitled "Sustainability of bioinformatics databases," the authors discussed the past year's controversy surrounding NCBI's SRA database, previous challenges with Swiss-Prot, and the current instability with the KEGG and TAIR databases.  They cite a proposal to centralize more resources [2].

But in the end biological databases really do model biology and follow the principals of evolution. Perhaps it is apropos that new ones emerge through speciation, while others go extinct.

1. Galperin, M., & Fernandez-Suarez, X. (2011). The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection Nucleic Acids Research, 40 (D1) DOI: 10.1093/nar/gkr1196

2. Parkhill J, Birney E, Kersey P. Genomic information infrastructure after the deluge. Genome Biol. 2010;11:402