FinchTalk: Bioinformatics

Showing posts with label Bioinformatics. Show all posts

Sunday, May 12, 2013

Sneak Peek: Elucidating the Effects of the Deep Water Horizon Oil Spill on the Atlantic Oyster Using RNA-Sequencing Data Analysis Methods

Join us this Tuesday, May 21st at 10 AM Pacific Time / 1:00 PM Eastern Time, for an interesting webinar on the effects of the Deep Water Horizon oil spill.

Speakers:
Natalia G. Reyero, PhD. – Mississippi State University
N. Eric Olson, PhD. – PerkinElmer Sr Leader Product Development

The Deep Water Horizon oil spill exposed the commercially important Atlantic oyster to over 200 million gallons of spill-related contaminants. To study toxicity effects, we sequenced the RNA of oyster samples from before and after the spill. In this webinar, we will compare and contrast the different data analysis methodologies used to address the challenge of an organism lacking a well-annotated genome assembly. Furthermore, we will discuss how the newly generated information provided insight into underlying biological effects of oil and dispersants on Atlantic oysters during the Deep Water Horizon oil spill.

Thursday, January 17, 2013

Bio Databases 2013

I seem to have committed to an annual ritual of summarizing the Nucleic Acids Research (NAR) Database Issue [1]. I do this because it is important to understand and emphasize the increasing role of data analysis in modern biology and remind us about the challenges that persist in turning data into knowledge.

Sometimes I hear individuals say they are building a database of all knowledge. To them I say good luck! The reality is new knowledge is developed from unique insights that are derived from specialized aggregations of information. Hence, as more data become available, through decreasing data collection costs, the number of resources and tools that are used to organize, analyze, and annotate data and information increases. Interestingly data cost decreases result from increased production due to technical improvements, which is an exponential function, whereas database growth is linear. Collecting data is the easy part.

How many are there?

Databases live in the wild and thus are hard to count. Reading the introduction to the database issue one would think 88 new databases were added (cited), but if you compare the number being tracked by NAR in 2012 (1380) to 2013 (1512), you get 132. Moreover, databases tracked by NAR are contributed by authors. Some don't bother with this. For example, SeattleSNPs, home of the SeattleSeq Annotation and important Genome Variant Servers*, is not listed in NAR. Nevertheless the NAR registry continues to increase by about 100 databases per year.

What's new?

Last year, I noted that the new databases did not reflect any discrenable pattern in terms of how the field of biology was changing. Rather the new databases reflect increasing specialization and complexity. That trend continues, but this year Fernández-Suárez and Galperin note the emergence of new databases for studying human disease. Altogether eight databases were cited in the introduction. Several others are listed in a table highlighting the new new databases. While databases specializing in human genetics are not new, the past year saw an increased emphasis on understanding the relationship between genotype and phenotype as we advance our understanding of rare variation and population genetics.

As noted, many databases support human genomics research. If you visit the NAR Database Summary Category List and expand the list of databases under the Human Genes and Diseases list, you find four sub categories (General human Genetics, general polymorphism, Cancer gene, and Gene-, system-, or disease-specific databases) listing approximately 174 database. I say approximately because, as noted above, databases are hard to count. Curiously, just above Human Genes and Diseases is a category called Human and Vertebrate Genomes. Database are hard to classify too.

What's useful?

It is clear that the growing number of databases reflects an increasing level of specialization. Also likely is a high degree of redundancy. 10 microRNA databases (found by virtue of starting with "miR") cover general and specific topics including miRNAs that are predicted from sequence or literature, verified by experiment as existing or having a target, being possibly pathogenic, or existing in different organisms. It would be interesting to see which of these databases have the same data, but that is hard as some sites make all data available and some make their data searchable only. In the former case, getting the data requires that it be put into a common format to make comparisons. Hence, access and interoperability issues persist.

Databases also persist. Fernández-Suárez and Galperin commented on efforts to curate the NAR collection. The annual attrition rare is less than 5% and greater than 90% of the databases are functional as determined by their response to webbots. Some have merged into other projects. What is not known is the quality of the information. In other words how are databases verified for accuracy or maintained to reflect our changing state of knowledge? As databases become increasing used in medical sequencing caveat emptor changes to caveat venditor and validation will be a critical component of design and maintenance. Perhaps future issues of the NAR database update will comment on these challenges.

Reference:
[1] Fernández-Suárez XM, and Galperin MY (2013). The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic acids research, 41 (D1) PMID: 23203983

Footnote:
* The SeattleSeq and Genome Variant Server links will break at the next update because the URL's contain the respective database version numbers.

Tuesday, December 4, 2012

Commonly Rare

Rare is the new common. The final month of the year is always a good time to review progress and think about what's next. In genetics, massively parallel next generation sequencing (NGS) technologies have been a dominating theme, and for good reason.

Unlike the previous high-throughput genetic analysis technologies (Sanger sequencing and microarrays), NGS allows us to explore genomes in far deeper ways and measure functional elements and gene expression in global ways.

What have we learned?

Distribution of rare and common variants. From [1]

The ENCODE project has produced a picture where a much greater fraction of the genome may be involved in some functional role than previously understood [1]. However, a larger theme has been related to observing rare variation, and trying to understand its impact on human health and disease. Because the enzymes that replicate DNA and correct errors are not perfect, each time a genome is copied a small number of mutations are introduced, on average between 35-80. Since sperm are continuously produced, fathers contribute more mutations than mothers, and the number of new mutations increases with the father's age [2]. While the number per child, with respect to their father's contributed three-billion base genome, is tiny, rare diseases and intellectual disorders can result.

A consequence is that the exponentially growing human population has accumulated a very large number of rare genetic variants [3]. Many of these variants can be predicted to affect phenotype and many more may modify phenotypes in yet unknown ways [4,5]. We are also learning that variants generally fall into two categories. They are either common to all populations or confined to specific populations (figure). More importantly, for a given gene the number of rare variants can vastly outnumber of the number of previously known common variants.

Another consequence of the high abundance of rare variation is how it impacts the resources that are used to measure variation and map disease to genotypes. For example, microarrays, which have been the primary tool of genome wide association studies utilize probes developed from a human reference genome sequence. When rare variants are factored in, many probes have several issues ranging from "hidden" variation within a probe to a probe simply not being able to measure a variant that is present. Linkage block size is also affected [6]. What this means it the best arrays going forward will be tuned to specific populations. It also means we need to devote more energy to developing refined reference resources, because the current tools do not adequately account for human diversity [6,7].

What's next?

Rare genetic variation has been understood for sometime. What's new is understanding just how extensive these variants are in the human population, which has resulted from the population recently rapidly expanding under very little selective pressure. Hence, linking variation to heath and disease is the next big challenge and the cornerstone of personalized medicine, or as some would like precision medicine. Conquering this challenge will require detailed descriptions of phenotypes, in many cases at the molecular level. As the vast majority of variants, benign or pathogenic, lie outside of coding regions we will need to deeply understand how those functional elements, as initially defined by ENCODE, are affected by rare variation. We will also need to layer in epigenetic modifications.

For the next several years the picture will be complex.

References:

1. 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), 56-65 PMID: 23128226

[2] Kong, A., et. al. (2012). Rate of de novo mutations and the importance of father’s age to disease risk Nature, 488 (7412), 471-475 DOI: 10.1038/nature11396

[3] Keinan, A., and Clark, A. (2012). Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants Science, 336 (6082), 740-743 DOI: 10.1126/science.1217283

[4] Tennessen, J., et. al. (2012). Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes Science, 337 (6090), 64-69 DOI: 10.1126/science.1219240

[5] Nelson, M., et. al. (2012). An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People Science, 337 (6090), 100-104 DOI: 10.1126/science.1217876

[6] Rosenfeld JA, Mason CE, and Smith TM (2012). Limitations of the human reference genome for personalized genomics. PloS one, 7 (7) PMID: 22811759

[7] Smith TM., and Porter SG. (2012) Genomic Inequality. The Scientist.

Sunday, August 5, 2012

Remembering Chris Abajian

Chris Abajian was a change catalyst. Using a biochemical analogy, passion, creativity, and intellect were his catalytic triad. Together, with Joe Slagel, Chris, and I started Geospiza in 1997. Sadly, Chris recently died in a hiking accident (7/30/12). In remembrance, I'll share a few stories from our times together.

I met Chris during my Postdoc in Leroy Hood's laboratory in 1994. This was the early days of the human genome project and we hired Chris, because in Lee's view we were going to build the best software if we had professional software engineers on the team. Lee was right.

Chris accepted our offer and from his first day, he made it clear this was not just a job where he could apply his software development talents, it was an opportunity to have an impact. And he did.

Sputnik

Chris used his passion, creativity and intellect to identify problems that needed to be solved, and then advocate creative solutions, passionately. One of his first programs was Sputnik - a tool that identifies microsatellite sequences. Sputnik was inspired by a co-worker of ours Lee Rowen. One day Chris observed Lee hunched over a ream of paper with printed DNA sequences one hand and a highlighter in the other. When he inquired as to what she was doing she responded "identifying microsatelites."

New to biology Chris asked what those were. Lee explained that they are small repeating patterns of di, tri, tetra, or slightly longer sequences and that we are interested in them because they can be involved in disease and change gene regulation. Chris quickly went to work. He talked to everyone around, learned that the repeated patterns were not always perfect and used this information to develop an algorithm and scoring table that could identify micro-satellite patterns with pretty good accuracy.

Today a google search on "sputnik microsatellite" or "sputnik-microsatelite" yields ~191,000 or ~65,000 hits, respectively. What's even more interesting is the number of papers, 12 and 15 years later, that compare different microsatellite algorithms to sputnik [1,2]. Not bad for a music major without any formal biology training!

Consed

After sputnik, Chris turned his attention to the next problem. This was in the early days of Phred and Phrap (P. Green, still unpublished) and we had no way to work with DNA sequence assemblies in graphical user interface (GUI). Not everyone was convinced we needed to build a whole new application. It would be a significant undertaking and other tools could be hacked to view Phrap assemblies. This is when we learned that when Chris set out to do something, he was going to get it done and do so convincingly. To Chris, and some others, it was clear Phrap needed its own GUI, so he set to work, debated the points and got buy-in. In collaboration with David Gordon, Chris proceeded to build Consed. Chris worked on the project for only a short time, but the work was a success. 17 years later, David has developed a large loyal user base and continues to develop new features for Consed [3].

The hunt for BRCA1

Chris and I worked closely on many software development projects, starting with data delivery for BRCA1. In 1994, we were asked to help "hunt" for the gene. In collaboration with Mary-Claire King, Francis Collins, Maynard Olson, and Lee Hood, we set out to find the BRCA1 gene. It had been previously localized to a large region of chromosome 17 by the King and Collins groups, and they had created a cosmid library of the region. With the high-throughput sequencing technology of 1994 we could include DNA sequencing in our strategy; one cosmid at a time. So, our job in the lab was to get cosmid DNA clones from the King and Collins libraries, sequence them and make the data available to everyone - simultaneously. How were we going to do that?

With web-technology

In 1994 the Mosiac web browser was new. Chris suggested that we could post the sequences to website and send emails to the respective parties when the data was posted. Problem solved!

During this time, I was learning to program and developing automation systems. It was a no brainer and we set to work, Chris created a framework that I could use to create automation scripts. This was going to be a theme that would result in several more successful projects and lead to the next adventure.

Geospiza

One day in April of 1997, Chris, Joe, and myself squeezed into the cab of Chris's small Toyota truck and headed to airport to interview with a new bioinformatics company called Pangea. We were hired, but it was soon clear that we needed to do something different. After a few rounds of passionate conversation we knew we were going to form a company, and we did.

Geospiza started in October of 1997 and while Chris was with us for only a short period of time, he made contributions that would last. Geospiza continues, now within PerkinElmer and there are probably still a few lines of his original code working within our LIMS system.

It's amazing to think about his accomplishments over the four years we spent together. We enjoyed many good times discussing science, literature, and music. Chris will be missed.

References

1. Kofler, R. (2007-07-01) SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics, 7(4), 524-1685. DOI: 10.1093/bioinformatics/btm157

2. Leclercq, S. (2007) Detecting microsatellites within genomes: significant variation among algorithms. BMC Bioinformatics, 8(1), 125. DOI: 10.1186/1471-2105-8-125

3. David Gordon, Chris Abajian, and Phil Green. Consed: a graphical tool for sequence finishing. Genome Res. 1998. 8: 195-202

Sunday, April 22, 2012

Sneak Peak: A Practical Approach to Detecting Nucleotide Variants in NGS Data

Join us Thursday, May 3, 2012 9:00 am (Pacific Time) for a webinar on analyzing DNA sequencing data with hundreds of thousands to millions of nucleotide variants.

Description:
This webinar discusses DNA variant detection using Next Generation Sequencing for targeted and exome resequencing applications as well as, whole transcriptome sequencing. The presentation includes an overview of each application and its specific data analysis needs and challenges with a particular emphasis on variant detection methods and approaches for individual samples as well as multi-sample comparisons. For in depth comparisons of variant detection methods, Geospiza’s cloud-based GeneSifter® Analysis Edition software will be used to assess sample data from NCBI’s GEO and SRA.

For more information, please visit the registration page.

Tuesday, February 14, 2012

Sneak Peek: Poster Presentations at AGBT

The annual Advances in Genome Biology and Technology (AGBT) begins tomorrow and would not be complete without a couple of contributions by @finchtalk.

Follow the tweets at #AGBT and if you are at the conference visit posters 334 and 335 (abstracts below). Also, visit Lanai 189 to see the latest advances in genome technology and software from the Caliper and Geospiza organizations within PerkinElmer.

Poster Abstracts

Poster 335: Why is the $1000 Genome so Expensive?

Rapid advances in sequencing technology are enabling leading institutions to establish programs for genomics-based medicine. Some estimate that 5000 genomes were sequenced during 2011, and an additional 30,000 will be sequenced by the end of 2012. Despite this terrific progress, the infrastructure required to make genomics-based medicine a norm, rather than a specialized application, are lacking. Although DNA sequencing costs are decreasing, sample preparation bottlenecks and data handling costs are increasing. In many instances, the resources (e.g. time, capital investment, experience) required to effectively conduct medical-based sequencing is prohibitive.

We describe a model system that uses a variety of PerkinElmer products to address three problems that continue to impact the widescale adoption of genomics-based medicine: organizing and tracking sample information, sample preparation, and whole genome data analysis. Specifically, PerkinElmer’s GeneSifter® LIMS and analysis software, Caliper instrumentation, and DNA sequencing services can provide independent or integrated solutions for generating and processing data from whole-genome sequencing.

Poster 334: Limitations of the Human Reference Genome Sequence

The human genome reference sequence is well characterized, highly annotated, and its development represents a considerable investment of time and money. This sequence is the foundation for genotyping microarrays and DNA sequencing analysis. Yet, in several critical aspects the reference sequence remains incomplete as are the many research tools that are based on it. We have found that, when new variation data from 1000 Genome Project (1Kg) and Complete Genomics (CG) are used to measure the effectiveness of existing tools and concepts, approximately 50% of probes on commonly used genotyping arrays contain confounding variation, impacting the results of 37% of GWAS studies to date. The sources of confounding variation include unknown variants in close proximity to the probed variant and alleles previously assumed to be di-allelic that are poly-allelic. When mean linkage disequillibrium (LD) lengths from HapMap are compared to 1Kg data, LD decreases from 16.4 Kb to 7.0 Kb within common samples and further decreases to 5.4 Kb when random samples are compared.

While many of the observations have been anecdotally understood, quantitative assessments of resources based on the reference sequence have been lacking. These findings have implications for the study of human variation and medical genetics, and ameliorating these discrepancies will be essential for ushering in the era of personalized medicine.

Tuesday, January 24, 2012

Sneak Peek: Cost Effective Sequencing by Perkin Elmer

Thursday, January 5, 2012

Bio Databases 2012

Let's get 2012 started with an update on the growth of biological databases. About this time last year, I summarized Nucleic Acids Research's (NAR) annual database issue where authors submit papers describing updates to existing databases and present new databases. In that post, I predicted that we would see between 60 and 120 new databases in 2011. This year's update included 92 new databases [1].

How have things changed?

Overall the number of databases being tracked by NAR has grown from 1330 in 2011 to 1380 in 2012 (left hand figure). As 92 new databases were added, 42 must have been dropped. Interestingly, when one views the new database list page only 90 databases are listed. I not only counted this using my fingers and toes - more than twice, but also copied the list into Excel to verify my original count. I wonder what the two, but unaccounted, new databases are? Never mind the 42 that disappeared, those are not listed.

What are the new databases?

Reviewing the list of the 90 new databases, does not reveal any clear patterns or trends regarding changes in biology. Instead, the list reflects increasing complexity and specialization. Some databases tackle new kinds of data, many appear to be refinements of existing databases, and others contain highly specific information. For example, the UMD-BRCA1/ BRCA2 database contains information about BRCA1 and BRCA2 mutations detected in France. A few databases, such as the BitterDB: a database of bitter taste molecules and receptors, and Newt-omics: data on the red spotted newt (Notophthalmus viridescens), and IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature, caught my attention because of their interesting descriptions. I especially like Disease Ontology: Ontology for a variety of human diseases. I wonder which diseases are their favorites?

Also of interest is the growth of wiki's as databases. This year, 10 of the new databases are wiki's where the community is invited to make contributions to the resource in a similar fashion to wikipedia. I should say almost 10 of the databases are wiki's, one, SeqAnswers is actually a forum, that's technically not a wiki, of course some might argue that a wiki might not be a database.

So, what does this mean?

The growing list of databases is an interesting way to think about biology, DNA, and proteins. Simply examining the list is instructional and one can think of ways that mining several resources together could create new insights. However, therein lies the challenge. Many of these resources are not designed to interoperate and it is not clear how long they will last or be updated. This final point was made at the close of the introduction article. In a section entitled "Sustainability of bioinformatics databases," the authors discussed the past year's controversy surrounding NCBI's SRA database, previous challenges with Swiss-Prot, and the current instability with the KEGG and TAIR databases. They cite a proposal to centralize more resources [2].

But in the end biological databases really do model biology and follow the principals of evolution. Perhaps it is apropos that new ones emerge through speciation, while others go extinct.

1. Galperin, M., & Fernandez-Suarez, X. (2011). The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection Nucleic Acids Research, 40 (D1) DOI: 10.1093/nar/gkr1196

2. Parkhill J, Birney E, Kersey P. Genomic information infrastructure after the deluge. Genome Biol. 2010;11:402

Friday, June 10, 2011

Sneak Peak: NGS Resequencing Applications: Part I – Detecting DNA Variants

Join us next Wed. June 15 for a webinar on resequencing applications.

Description:

This webinar will focus on DNA variant detection using Next Generation Sequencing for the applications of targeted and exome resequencing as well as, whole transcriptome sequencing. The presentation will include an overview of each application and its specific data analysis needs and challenges. Topics covered will include Secondary Analysis (alignments, reference choices, variant detection) with a particular emphasis on DNA variant detection as well as multi-sample comparisons. For in depth comparisons of variant detection methods, Geospiza’s cloud-based GeneSifter Analysis Edition software will be used to assess sample data from NCBI’s GEO and SRA. The webinar will also include a short presentation on how these tools can be deployed for both individual researchers as well as through Geospiza’s Partner Program for NGS sequencing service providers.

Details:

Date and time: Wednesday, June 15, 2011 10:00 am
Pacific Daylight Time (San Francisco, GMT-07:00)
Wednesday, June 15, 2011 1:00 pm

Eastern Daylight Time (New York, GMT-04:00)

Wednesday, June 15, 2011 6:00 pm
GMT Summer Time (London, GMT+01:00)
Duration: 1 hour

Visit the webex site to register.

Wednesday, April 6, 2011

Sneak Peak: RNA-Sequencing Applications in Cancer Research: From fastq to differential gene expression, splicing and mutational analysis

Join us next Tuesday, April 12 at 10:00 am PST for a webinar focused on RNA-Seq applications in breast cancer research.

The field of cancer genomics is advancing quickly. News reports from the annual American Association of Cancer Research meeting are indicating that whole genome sequencing studies such as the 50 breast cancer genomes (WashU) are providing more clues about the genes that may be affected in cancer. Meanwhile, the ACLU/Myriad Genetics legal action over genetic testing for breast cancer mutations and disease predisposition continues to move towards the supreme court.

Breast cancer, like many other cancers, is complex. Sequencing genomes is one way to interrogate cancer biology. However, the genome sequence data in isolation does not tell the complete story. The RNA, representing expressed genes, their isoforms, and non-coding RNA molecules, needs to be measured too. In this webinar, Eric Olson, Geospiza's VP of product development and principal designer of GeneSifter Analysis Edition, will explore the RNA world of breast cancer and present how you can explore existing data to develop new insights.

Abstract
Next Generation Sequencing applications allow biomedical researchers to examine the expression of tens of thousands of genes at once, giving researchers the opportunity to examine expression across entire genomes. RNA Sequencing applications such as Tag Profiling, Small RNA and Whole Transcriptome Analysis can identify and characterize both known and novel transcripts, splice junctions and non-coding RNAs. These sequencing based-applications also allow for the examination of nucleotide variant. Next Generation Sequencing and these RNA applications allow researchers to examine the cancer transcriptome at an unprecedented level. This presentation will provide an overview of the gene expression data analysis process for these applications with an emphasis on identification of differentially expressed genes, identification of novel transcripts and characterization of alternative splicing as well as variant analysis and small RNA expression. Using data drawn from the GEO data repository and the Short Read Archive, NGS Tag Profiling, Small RNA and NGS Whole Transcriptome Analysis data will be examined in Breast Cancer.

You can register at the webex site, or view the slides after the presentation.

Thursday, March 10, 2011

Sneak Peak: The Next Generation Challenge: Developing Clinical Insights Through Data Integration

Next week (March 14-18, 2011) is CHI's X-Gen Congress & Expo. I'll be there presenting a poster on the next challenge in bioinformatics, also known as the information bottleneck.

You can follow the tweet by tweet action via @finchtalk or #XGenCongress.

In the meantime, enjoy the poster abstract.

The next generation challenge: developing clinical insights through data integration

Next generation DNA sequencing (NGS) technologies hold great promise as tools for building a new understanding of health and disease. In the case of understanding cancer, deep sequencing provides more sensitive ways to detect the germline and somatic mutations that cause different types of cancer as well as identify new mutations within small subpopulations of tumor cells that can be prognostic indicators of tumor growth or drug resistance. Intense vendor competition amongst NGS platform and service providers are commoditizing data collection costs making data more assessable. However, the single greatest impediment to developing relevant clinical information from these data is the lack of systems that create easy access to the immense bioinformatics and IT infrastructures needed for researchers to work with the data.

In the case of variant analysis, such systems will need to process very large datasets, and accurately predict common, rare, and de novo levels of variation. Genetic variation must be presented in an annotation-rich, biological context to determine the clinical utility, frequency, and putative biological impact. Software systems used for this work must integrate data from many samples together with resources ranging from core analysis algorithms to application specific datasets to annotations, all woven into computational systems with interactive user interfaces (UIs). Such end-to-end systems currently do not exist, but the parts are emerging.

Geospiza is improving how researchers understand their data in terms of its biological context, function and potential clinical utility, by develop methods that combine assay results from many samples with existing data and information resources from dbSNP, 1000 Genomes, cancer genome databases, GEO, SRA and others. Through this work, and follow on product development, we will produce integrated sensitive assay systems that harness NGS for identifying very low (1:1000) levels of changes between DNA sequences to detect cancerous mutations, emerging drug resistance, and early-stage signaling cascades.

Authors: Todd M. Smith(1), Christoper Mason(2)

(1). Geospiza Inc. Seattle WA 98119, USA.

(2). Weil Cornell Medical College, NY NY 10021, USA

Friday, February 11, 2011

Variant Analysis and Sequencing Labs

Yesterday and two weeks ago, Geospiza released two important news items. The first announced PerkinElmer's section of the GeneSifter® Lab and Analysis systems to support their new DNA sequencing service. The second was an announcement of our new SBIR award to improve variant detection software.

Why are these important?

The PerkinElmer news is another validation of the fact that software systems need to integrate laboratory operations and scientific data analysis in ways that go deeper than sending computing jobs to a server (links below). To remain competitive, service labs can no longer satisfy customer needs by simply delivering data. They must deliver a unit of information that is consistent with the experiment, or assay, that is being conducted by their clients. PerkinElmer recognizes this fact along with our many other customers who participate in our partner program and work with our LIMS (GSLE) and Analysis (GSAE) systems to support their clients doing Sanger sequencing, Next Gen sequencing, or microarray analysis.

The SBIR news communicates our path toward making the units of information that go with whole genome sequencing, targeted sequencing, exome sequencing, allele analysis, and transcriptome sequencing as rich as possible. Through the project, we will add new applications to our analysis capabilities that solve the three fundamental problems related to data quality analysis, information integration, and the user experience.

Through advanced data quality analysis we can address the question, if we see a variant, can we understand wether the difference is random noise, a systematic error, or biological signal?
Information integration will help scientists and clinical researchers quickly add biological context to their data.
New visualization and annotation interfaces will help individuals explore the high-dimensional datasets resulting from the 100's and 1000's of samples needed to develop scientific and clinical insights.

Further Reading

Geospiza wins $1.2 M grant to add new DNA variant application to GeneSifter software.

PerkinElmer selects Geospiza for Next Generation Sequencing.

Keeping Your DNA Sequencing, Genotyping, and Microarray Laboratory Competitive in a New Era of Genomics .

Bloginar: Next Gen Laboratory Systems for Core Facilities

Tuesday, February 8, 2011

AGBT 2011

More.

That's how I describe this year's conference.

More attendees
More data
More genomes
More instruments
More tweeters
More tweeting controversy
More software
More ...

Feel free to add more comments.

Friday, January 21, 2011

dbSNP, or is it?

dbSNP is NCBI’s catalog of DNA variation. While the SNP in the name implies a focus on Single Nucleotide Polymorphisms, dbSNP is far more comprehensive and includes length variants, mutations, and a plethora of annotations that characterize over 75 million variants from 89 organisms.

Previously, I discussed how the numbers of biological information repositories are growing each year. Nucleic Acids Research now tracks over 1300 databases that contain specialized subsets of DNA, RNA, protein sequences, and other kinds of biochemical data along with annotations (metadata) that can be mined or searched to aid our scientific research.

Most of these resources have descriptions and publications describing high-level details about their mission with respect to what kinds of data are stored and how the resource can be used. While useful, these descriptions are typically out of date, because, like everything else in information science, each repository is undergoing significant growth in terms of data stored and the data’s annotations. As we contemplate how to use these resources in integrated analyses we need methods to summarize their content and extract data in global ways.

As an example, let’s consider dbSNP. According to dbSNP’s build history the first release was in Dec 1, 1998. Build 2, 9 days later, had 11 new SNPs added from the debnick’s 981209.dat file. Now, dbSNP is 12 years old and is at build 132. The term build is a software way of saying version.

Several methods can be used to access dbSNP data. These include multiple web interfaces at NCBI and flat files that hold entire datasets in XML, SQL tables, and VCF formats. When first published, in 1999 [1], dbSNP contained 4713 variants and few annotations. As of build 132, the human specific database contains either 30,443,455 (release notes) or 28,826,054 variants (VCF file). Why the discrepancy? Share your idea with a comment.

What else can we learn about human DNA variation from dbSNP’s VCF file?

The entire collection of human variants - minus 1,617,401 - can be obtained in a single VCF file. VCF stands for Variant Call Format ; it is a standard created by the 1000 Genomes project [2] to list and annotate genotypes. The vcard format also uses the “vcf” file extension . Thus, genomic vcf files have cute business card icons on Macs and Windows. Bioinformatics is fun that way.

The dbSNP VCF file is a convenient way to get a global view of the database. Variants are listed by chromosome and position. Each position also lists the NCBI rs ID, reference, and variant sequence(s). Chromosomes include the 22 autosomes, X, Y, MT (mitochondria) and PAR (Pseudoautosomal Regions[3]). The last column (INFO) is the most interesting. It can contain up to 49 specific annotations that describe a variant’s type, its origin, biological features, population characteristics, and whether is it linked to other resources. Many of the annotations are tags, but some contain additional information. Between the different tags and information within tags, there are over 50 ways to describe a variant. One of the most obfuscated annotations is a 12-byte bitfield, which must be decoded to understand. No worry, much of its information appears to be repeated in the readable annotations. Others have also noted that the bitfield is overly clever.

Digging deeper

We can use the VCF file to learn a great deal about dbSNP and the biology of human variation. But, you cannot do this by reading the file. It is over 30 million lines long! You’re also not going to be able to analyze these data in Excel^TM, so you need to put the data in some kind of database or binary file format. I used HDF5 and pytables to put the data into essentially a 30MM row by 50+ column table that can be efficiently queried using simple python scripts.

So, what can we learn? We can use the dbSNPBuildNumber annotation to observe dbSNP’s growth. Cumulative growth shows that the recent builds have contributed the majority of variants. In particular, the 1000 Genomes project’s contributions account for close to 85% of the variants in dbSNP[2]. Examining the data by build number also shows that a build can consist of relatively few variants. There are even 369 variants from a future 133 build.

In dbSNP, variants are described in four ways: SNP, INDEL, MIXED, and MULTI-BASE. The vast majority (82%) are SNPs, with INDELs (INsertions and DELetions) forming a large second class. When MIXED and MULTI-BASE variants are examined, they do not appear to look any different than INDELS, so I am not clear on the purpose of this tag. Perhaps it is there for bioinformatics enjoyment, because you have to find these terms in the 30MM lines; they are not listed in the VCF header. It was also interesting to learn that the longest indels are on the order of a 1000 bases.

The VCF file contains both reference and variant sequences at each position. The variant sequence can list multiple bases, or sequences, as each position can have multiple alleles. While nearly all variants are a single allele, about 3% have between two and 11 alleles. I assume the variants with the highest numbers of alleles represent complex indel patterns.

Given that the haploid human genome contains approximately 3.1 billion bases, the variants in dbSNP represent nearly 1% of the available positions. Their distribution across the 22 autosomes, as defined by the numbers of positions divided by the chromosome length, is fairly constant around 1%. The X and Y chromosomes and their PARs are, however much lower, with ratios of 0.45%, 0.12%, and 0.08% respectively. A likely reason for this observation is that sampling of sex chromosomes in current sequencing projects is low relative to autosomes. The number of mitochondria variants, as a fraction of genome length, is, on the other hand much higher at 3.95%.

It is important to point out that the 1% level of variation is simply a count of all currently known positions where variation can occur in individuals. These data are compiled from many individuals. On a per individual basis, the variation frequency is much lower with each individual having between three and four million differences when their genome is compared to the other people's genomes.

Clearly, from the original publications and current dbSNP reports, the database has grown significantly and is much richer and undergoing changes in ways that cannot be easily communicated through traditional publication methods. The challenge going forward is developing software that can adapt to these changes to evaluate resources and use their content in effective ways.

Next time I’ll discuss the annotations.

References and notes:

1. Sherry ST, Ward M, & Sirotkin K (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9 (8), 677-9 PMID: 10447503

2. 1000 Genomes Project (1KG) is an endeavor to characterize DNA sequence variation in humans. Visit the website at www.1000genomes.org or read the paper: 1000 Genomes Consortium. (2010) A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073.

3. PAR - PseudoAutosomal Regions - are located at the ends of the X and Y chromosome. The DNA in these regions can recombine between the sex chromosomes, hence PAR genes demonstrate autosomal inheritance patterns, rather than sex inheritance patterns. You can read more at http://en.wikipedia.org/wiki/Pseudoautosomal_region

Wednesday, January 5, 2011

Databases of databases

To kick off 2011, let’s talk about databases. Our ability to collect ever increasing amounts of data at faster rates is driving a corresponding increase in specialized databases that organize data and information. A recent editorial publication in the journal Database, yes we now have a journal called Database, proposed a draft information specification for biological databases.

Why?

Because we derive knowledge by integrating information in novel ways, hence we need ways to make specialized information repositories interoperate.

Why do we need standards?

Because the current and growing collection of specialized databases are poorly characterized with respect their mission, categorization, and practical use.

The Nucleic Acids Research (NAR) database of databases illustrates the problem well. For the past 18 years, every Jan 1, NAR has published a database issue in which new databases are described along with others that have been updated. The issues typically contain between 120 and 180 articles representing a fraction of the databases listed by NAR. When one counts the number of databases, explores their organization, and reads the accompanying editorial introductions several interesting observations can be made.

First, over the years there has been a healthy growth of databases designed to capture many different kinds of specialized information. This year’s NAR database was updated to include 1330 databases, which are organized into 14 categories and 40 subcategories. Some categories like Metabolic and Signaling Pathways, or Organelle Databases look highly specific whereas others such as Plant Databases, or Protein Sequence Databases are general. Subcategories have similar issues. In some cases categories and subcategories make sense. In other cases it is not clear what the intent of categorization is. For example the category RNA sequence databases lists 73 databases that appear to be mostly rRNA and smallRNA databases. Within the list are a couple of RNA virus databases like the HIV Sequence Database and Subviral RNA Database. These are also listed under subcategory Viral Genome Databases in the Genomics Databases (non-vertebrates) category in an attempt to cross reference the database under different category. OK, I get that. HIV is in RNA sequences because it is an RNA virus. But what about the hepatitis, influenza, and other RNA viruses listed under Viral genome databases, why aren’t they in RNA sequences? While I’m picking on RNA sequences, how come all of the splicing databases are listed in a subcategory in Nucleotide Sequence Databases? Why isn’t RNA Sequences a subcategory under Nucleotide Sequence Databases? Isn’t RNA composed of Nucleotides? It makes one wonder how the databases are categorized.

Categorizing databases to make them easy to find and quickly grock their utility is clearly challenging. This issue becomes more profound when the level of database redundancy, determined from the databases’ names, is considered. This analysis is, of course, limited to names that can be understood. The Hollywood Database for example does not store Julia Roberts' DNA, rather it is an exon annotation database. Fortunately, many databases are not named so cleverly. Going back to our RNA Sequence category we can find many ribosomal sequence databases, several tRNA databases, and general RNA sequence databases. There is even a cluster of eight microRNA databases all starting with an “mi” or “miR” prefix. There are enough rice (18) and Arabidopsis databases (28) that they get their own subcategories. Without too much effort one can see there are many competing resources, yet choosing the ones best suited for your needs would require a substantial investment of time to understand how these databases overlap, where they are unique, and, in many cases, navigate idiosyncratic interfaces and file formats to retrieve and utilize their information. When maintenance and overall information validity is factored in, the challenge compounds.

How do things get this way?

Evolution. Software systems, like biological systems change over time. Databases like organisms evolve from common ancestors. In this way, new species are formed. Selective pressures enhance useful features and increase their redundancy, and cause extinctions. We can see these patterns in play for biological databases by examining the tables of contents and introductory editorials for the past 16 years of the NAR database issue. Interestingly, in the past six or seven years, the issue’s editor has made the point of recording the issue’s anniversary making 2011 the 18th year of this issue. Yet, easily accessible data can be obtained only to 1996, 16 years ago. History is hard and incomplete, just like any evolutionary record.

We cannot discuss database diversity without trying to count the number of species. This is hard too. NAR is an easily assessable site with a list that can be counted. However, one of the databases in the 1999 and 2000 NAR database issues was DBcat, a database of databases. At that time it tracked more than 400 and 500 databases, while NAR tracked 103 and 227 databases, respectively, a fairly large discrepancy. DBcat eventually went extinct due to lack of funding, a very significant selective pressure. Speaking of selective pressure it would be interesting to understand how databases are chosen to be included in NAR’s list. Articles are submitted, and presumably reviewed, so there is self selection and some peer review. The total number of new databases, since 1996, is 1,308, which is close to 1,330, so the current list likely an accumulation of database entries submitted over the years. From the editorial comments, the selection process is getting more selective as statements have changed from databases that might be useful (2007) to "carefully selected" (2010, 2011). Back in 2006, 2007, and 2008 there was even mention of databases being dropped due to obsolescence.

Where do we go from here?

From the current data, one would expect that 2011 will see between 60 and 120 new databases appear. Groups like BioDBcore and other committees and trying to encourage standardization with respect to a database’s meta data (name, URL, contacts, when established, data stored, standards used, and much more). This may be helpful, but when I read the list of proposed standards, I am less optimistic because the standards do not address the hard issues. Why, for example, do we need 18 different rice databases or eight “mi*” databases. For that matter, why do we have a journal called Database? NAR does a better job listing databases than Database and being a journal about databases wouldn't it be a good idea if Database tracked databases? And, critical information, like usage, last updated, citations, and user comments, which would help one more easily evaluate whether to invest time investigating the resource, are missing.

Perhaps we should think about better ways to communicate a database’s value to the community it serves, and in the case where several databases do similar things, standards committees should discuss how to bring the individual databases together to share their common features and accentuate their unique qualities to support research endeavors. After all, no amount of annotation is useful if I still have too many choices to sort through.

Wednesday, November 3, 2010

Samples to Knowledge

Today Geospiza and Ingenuity announced a collaboration to integrate our respective GeneSifter Analysis Edition (GSAE) and Ingenuity Pathway Analysis (IPA) software systems.

Why is this important?

Geospiza has always been committed to providing our customers the most complete software systems for genetic analysis. Our LIMS [GeneSifter Laboratory Edition] and GSAE have worked together to form a comprehensive samples to results platform. From core labs, to individual research groups, to large scale sequencing centers, GSLE is used for collecting sample information, tracking sample processing, and organizing the resulting DNA sequences, microarray files, and other data. Advanced quality reports keep projects on track and within budget.

For many years, GSAE has provided a robust and scalable way to scientifically analyze the data collected for many samples. Complex datasets are reduced and normalized to produce quantitative values that can be compared between samples and within groups of samples. Additionally, GSAE has integrated open-source tools like Gene Ontologies and KEGG pathways to explore the biology associated with lists of differentially expressed genes. In the case of Next Generation Sequencing, GSAE has had the most comprehensive and integrated support for the entire data analysis workflow from basic quality assessment to sequence alignment and comparative analysis.

With Ingenuity we will be able to take data-driven biology exploration to a whole new level. The IPA system is a leading platform for discovering pathways and finding the relevant literature associated with genes and lists of genes that show differential expression in microarray analysis. Ingenuity's approach focuses on combining software curation with expert review to create a state-of-the-art system that gets scientists to actionable information more quickly than conventional methods.

Through this collaboration two leading companies will be working together to extend their support for NGS applications. GeneSifter's pathway analysis capabilities will increase and IPA's support will extend to NGS. Our customers will benefit by having access to the most advanced tools for turning vast amounts of data into biologically meaningful results to derive new knowledge.

Samples to Results^TM becomes Samples to Knowledge^TM

Thursday, October 28, 2010

Bloginar: Making cancer transcriptome sequencing assays practical for the research and clinical scientist

A few weeks back we (Geospiza and Mayo Clinic) presented a research poster at BioMed Central’s Beyond the Genome conference. The objective was to present GeneSifter’s analysis capabilities and discuss the practical issues scientists face when using Next Generation DNA Sequencing (NGS) technologies to conduct clinically orientated research related to human heath and disease.

Abstract
NGS technologies are increasing in their appeal for studying cancer. Fully characterizing the more than 10,000 types and subtypes of cancer to develop biomarkers that can be used to clinically define tumors and target specific treatments requires large studies that examine specific tumors in 1000s of patients. This goal will fail without significantly reducing both data production and analysis costs so that the vast majority of cancer biologists and clinicians can conduct NGS assays and analyze their data in routine ways.

While sequencing costs are now inexpensive enough for small groups and individuals, beyond genome centers, to conduct the needed studies, the current data analysis methods need to move from large bioinformatics team approaches to automated methods that employ established tools in scalable and adaptable systems to provide standard reports and make results available for interactive exploration by biologists and clinicians. Mature software systems and cloud computing strategies can achieve this goal.

Poster Layout

Excluding the title, the poster has five major sections. The first section includes the abstract (above) and study parameters. In the work, we examined the RNA from 24 head and neck cancer biopsies from 12 individuals' tumor and normal cells.

The remaining sections (2-5), provide a background of NGS challenges, applications, high-level data analysis workflows, the analysis pipeline used in the work, the comparative analyses that need to be conducted, and practical considerations for groups seeking to do similar work. Much of section 2 has been covered in previous blogs and research papers.

Section 3: Secondary Analysis Explores Single Samples
NGS challenges are best known for the amount of data produced by the instruments. While this challenge should not be undervalued, it is over discussed. A far greater challenge lies in the complexity of data analysis. Once the first step (primary analysis, or basecalling) is complete, the resulting millions of reads must be aligned to several collections of reference sequences. For human RNA samples, these include the human genome, splice junction databases, and others to measure biological processes and filter out reads arising from artifacts related to sample preparation. Aligned data are further processed to create tables that annotate individual reads and compute quantitative values related to how the sample’s reads align (or cover) regions of the genome or span exon boundaries. If the assay measures sequence variation, alignments must be further processed to create variant tables.

Secondary analysis produces a collection of data in forms that can be immediately examined to understand overall sample quality and characteristics. High-level summaries indicate how many reads align to things we are interested in and not interested in. In GeneSifter, these summaries are linked to additional reports that show additional detail. Gene List reports, for example, show how the sample reads align within a gene’s boundary. Pictures in these reports are linked to Genesifter's Gene Viewer reports that provide even greater detail about the data with respect to each read’s alignment orientation and observed variation.

An important point about secondary analysis, however, is that it focuses on single sample analyses. As more samples are added to the project, the data from each sample must be processed through an assay specific pipeline. This point is often missed in the NGS data analysis discussion. Moreover, systems supporting this work must not only automate 100s of secondary analysis steps, they must also provide tools to organize the input and output data in project-based ways for comparative analysis.

Section 4: Tertiary Analysis in GeneSifter Compares Data Between Samples
The science happens in NGS when data are compared between samples in statistically rigorous ways. RNA sequencing makes it possible to compare gene expression, exon expression, and sequence variation between samples to identify differentially expressed genes, their isoforms, and whether certain alleles are differentially expressed. Additional insights are gained when gene lists can be examined in pathways and by ontologies. GeneSifter performs these activities in a user-friendly web-environment.

The poster's examples show how gene expression can be globally analyzed for all 24 samples, how a splicing index can distinguish gene isoforms occurring in tumor, but not normal cells, and how sequence variation can be viewed across all samples. Principal component analysis shows that genes in tumor cells are differentially expressed relative to normal cells. Genes highly expressed in tumor cells include those related to cell cycle and other pathways associated with unregulated cell growth. While these observations are not novel, they do confirm our expectations about the samples and being able to make such an observation with just a few clicks prevents working on costly misleading observations. For genes showing differential exon expression, GeneSifter provides ways to identify those genes and navigate to the alignment details. Similarly reports that show differential variation between samples can be filtered by multiple criteria in reports that link to additional annotation details and read alignments.

Section 5: Practical Considerations
Complete NGS data analysis systems seamlessly integrate secondary and tertiary analysis. Presently, no other systems are as complete as GeneSifter. There are several reasons why this is the case. First, a significant amount of software must be produced and tested to create such a system. From complex data processing automation, to advanced data queries, to user interfaces that provide interactive visualizations and easy data access, to security, software systems must employ advanced technologies and take years to develop with experienced teams. Second, meeting NGS data processing requirements demands that computer systems be designed with distributable architectures that can support cloud environments in local and hosted configurations. Finally, scientific data systems must support both predefined and ad hoc query capabilities. The scale of NGS applications means that non-traditional approaches must be used to develop data persistence layers that can support a variety of data access methods and, for bioinformatics, this is a new problem.

Because Geospiza has been doing this kind of work for over a decade and could see the coming challenges, we’ve focused our research and development in the right ways to deliver a feature rich product that truly enables researchers to do high quality science with NGS.

Enjoy the poster.

Wednesday, September 29, 2010

A Genomics Genealogy

Deep sequencing technologies have radically changed how we study biology. Deciding what technology and software to use can be daunting. Choices become easier when the relationships between different DNA sequencing applications are understood.

A brief history

DNA sequencing grew from our desire to understand how the instructions for the biochemistry of life are encoded in an organism’s DNA. If we know the precise ordering and organization of an organism’s DNA sequence, we can presumably unlock a code that reveals these instructions. Accomplishing this goal required the creation of a new field, molecular biology, and new technologies to sequence genes.

The first sequencing methods were arduous. They combined nuclease digestion with thin layer chromatography to measure di- and trinucleotides that could be puzzled together. Later, Maxim and Gilbert replaced enzymatic DNA degradation with a chemical fragmentation method that enabled the reading of ordered bases from ³²P labeled fragments separated by electrophoresis.

The Sanger method, which used dideoxynucleotide triphosphates to create ensembles of DNA molecules terminated at each base, soon replaced Maxim Gilbert sequencing. The next innovation was to color code DNA with fluorescent dyes so that molecules could be interrogated with a laser and camera coupled to a computer. This innovation automated “high-throughput” DNA sequencing systems, initially with polyacrylamide gels and later with capillary electrophoresis, and made it possible to sequence the human and other genomes. It also created the first transcriptome analysis method, Expressed Tag Sequencing (EST).

Despite 20 years of advances, however, the high-throughput sequencing methods were not high-enough-throughput to realistically interrogate DNA and RNA molecules in creative ways. Big questions (genomes, ESTs, meta-genomes) required large factory-like approaches to automate sample preparation and collect sequences because a fundamental problem had yet to be solved. Specially, each sequence was obtained from an individual purified DNA clone or PCR product.

Real high-throughput is massively parallel throughput

The next-generation DNA sequencing (NGS) technologies free researchers from the need to clone or purify every molecule. They all share the common innovation that DNA sequencing is performed in a massively parallel format. That is a library, or ensemble of millions of DNA molecules, are simultaneously sequenced. Data collection costs are dramatically decreased through miniaturization and by eliminating the need for warehouses of colony pickers, prep robots, sequencing instruments, and large teams of people.

The new problem is dealing with the data that are produced and increasing computation costs. As NGS opens new possibilities to measure DNA and RNA in novel ways, each application requires a specific laboratory procedure that must be coupled to a specific analysis methodology.

Sequencing genealogy is defined by the questions

In an evolutionary model, the history of cloning, restriction site mapping, and Sanger sequencing form the trunk of the genomics application tree (top figure) from which branches develop as new applications emerge.

NGS has driven the evolution of three main sequencing branches: De Novo, Functional Genomics, and Variation Assays. The De Novo, or Exploratory, sequencing contains three subbranches that include new genomes (projects that seek to determine a complete genome sequence of an organism), meta-genomes (projects in which DNA fragments are sequenced from environmental samples), or meta-transcriptomes (projects where cDNA fragments are sequenced from environmental samples).

The Functional Genomics branch is growing fast. In these experiments, different collections of RNA or DNA molecules from an organism, tissue, or cells, are isolated and sequenced to measure gene expression and how it is regulated. Three subbranches describe the different kinds of function genomics: Expression, Regulation, and EpiGenomics, and each of these subbranches can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved) that are defined by laboratory protocols, kits, and instruments. When the experiments are refined and are made reproducible, they become assays.

Variation Assays form the third main branch of the tree. Genomic sequences are compared within and between populations to link genotype and phenotype. In special cases like cancer and immunology research, variation assays are used to observe changes within an organism’s somatic genomes over time. Today, variation, or resequencing, assays measure nucleotide and small insertions and deletions in whole genomes and exomes. If linked sequence strategies (mate-pairs, paired-ends) are used, larger structural changes including copy number variations can also be measured.

Why is this important?

As a software provider with both deep lab and analysis experience, we [Geospiza] are often asked questions about what instrument platform is the best or how our software stacks up against other available options. The answer, of course, depends on what you want to do. De Novo applications benefit from long reads offered by platforms like 454. Many of the assay-based applications demand ultra-deep sequencing with very high numbers of sequences (reads) as provided by the short-read platforms (Illumina, SOLiD). New single molecule sequencing platforms like PacBio's are targeting a wide rage of applications but have best been demonstrated, thus far, for long-read uses and novel methylation assays.

From an informatics perspective, the exploratory and assay-based branches have distinct software requirements. Exploratory applications require that reads be assembled into contigs that must be further ordered into scaffolds to get to the complete sequence. In meta-genomics or meta-transcriptomics applications, data are assembled to obtain gene sequences. These projects are further complicated by orthologous and paralogous sequences and highly expressed genes that over represent certain sequences. In these situations, specialized hardware or complex data reduction strategies are needed to make assembly practical. Once data are assembled, they are functionally annotated in a second computational phase using tools like BLAST.

Assay-based data analysis also has two distinct phases, but they are significantly different from De Novo sequencing. The first phase involves aligning (or mapping) reads to reference data sources and then reducing the aligned data into quantitative values. At least one reference is required and the better it is annotated the more informative the initial results will be. Alignment differs from assembly in that reads are separately compared to a reference rather than amongst themselves. Alignment processing capacity can be easily scaled with multiple inexpensive computers whereas assembly processing cannot.

The second phase of Assay-based sequencing is to produce a discrete output as defined by a diagnostic application, or compare the quantitative values computed from the alignments from several samples, obtained from different individuals and (or) treatments relative to controls. This phase requires statistical tools to normalize data, filter false positives and negatives, and measure differences. Assay-based applications become more informative when large numbers of samples and replicates are included in a study.

Connecting the dots

While the sequencing applications can be grouped and summarized in different ways, they are also interrelated. For example, De Novo projects are open-ended and exploratory, but their end product, a well-annotated reference sequence, is the foundation for Functional Genomics and Variation applications. Variation analysis is only useful if we can assign function to specific genotypes. Functional assignments come, in part, from previous experiments and genomic annotations, but are increasingly being produced by sequencing assays, so the new challenge is integrating that data obtained from different assays into coherent datasets that can link many attributes to a set of genotypes.

NGS clearly opens new possibilities for studying and characterizing biological systems. Different applications require different sequencing platforms, laboratory procedures, and software systems that can organize analysis tools and automate data processing. On this last point, as one evaluates their projects and their options for being successful, they need to identify informatics groups that have deep experience, available solutions, and strong capabilities to meet the next challenges. Geospiza is one such group.

Further Reading

DNA Sequencing History

Gilbert W, Maxam A (1973) The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70:3581

Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74:560

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463-7

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321:674-9

Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cdna sequencing (expressed sequence tags) from a directionally cloned human infant brain cdna library. Nat Genet 4:373-80

International Human Genome Sequencing Consortium, 2001. “Initial sequencing and analysis of the human genome.” Nature 409, 860-921.

Venter J.C., Adams M.D., Myers E.W., et. al. 2001. “The sequence of the human genome.” Science 291, 1304-1351.

FinchTalks

From Reads to Datasets Why Next Gen is Not Like Sanger
Expeditiously Exponential: Genome Standards in a New Era

Next Gen DNA Sequencing Is Not Sequencing DNA
Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data