FinchTalk: DNA sequencing

Showing posts with label DNA sequencing. Show all posts

Saturday, February 9, 2013

Genomics Genealogy Evolves

The ways massively parallel DNA sequencing can be used measure biological systems is only limited by imagination. In science, imagination is an abundant resource.

The November 2012 edition of Nature Biotechnology (NBT) focused on advances in DNA sequencing. It included a review by Jay Schendure and Eriz Lieberman Aiden entitled “The Expanding Scope of DNA Sequencing [1],” in which the authors provided a great overview of current and future sequencing-based assay methods with an interesting technical twist. It also made for an opportunity to update a previous Finchtalk.

As DNA sequencing moved from determining the order of nucleotide bases in single genes to the factory style efforts of the first genomes, it was limited to measuring ensembles of molecules derived from single clones or PCR amplicons as composite sequences. Massively parallel sequencing changed the game because each molecule in a sample is sequenced independently. This discontinuous advance resulted in a massive increase in throughput that created a brief, yet significant, deviation in the price performance curve that would be predicted from Moore’s law. It also created a level of resolution that makes it possible to collect data from populations of sequences and see how they vary in a quantitative fashion making it possible to use DNA sequencing as a powerful assay platform. While this was quickly recognized [2], reducing ideas to practice would take a few more years.

Sequencing applications fall into three three main branches: De Novo, Functional Genomics, and Genetics (figure below). The De Novo, or Exploratory branch contains three subbranches: new genomes, meta-genomes, or meta-transcriptomes. Genetics or variation assays form another main branch of the tree. Genomic sequences are compared within and between populations, individuals, or tissue and cells with the goal predicting a phenotype from differences between sequences. Genetic assays can focus on single nucleotide variations, copy number changes or structural differences. Determining inherited epigenetic modifications is another form of genetic assay.

Understanding the relationship between genotype and phenotype, however, requires that we understand phenotype in sufficient detail. In order for this to happen, traditional analog measurements such as height, weight, blood pressure, and disease descriptions need to be replaced with quantitative measurements at the DNA, RNA, protein, metabolism, and other levels. Within each set of “omes” we need to understand molecular interactions and the how the environmental factors such as diet, chemicals, and microorganisms impact these interactions positively or negatively and through modification of the epigenome. Hence, the Functional Genomics branch is fastest growing.

New assays since 2010 are highlighted in color and underlined text. See [1] for descriptions.

Functional Genomics experiments can be classified into five groups: Regulation, Epi-genomics, Expression, Deep Protein Mutagenesis, and Gene Disruption. Each group can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved). When experiments are refined and made reproducible, they become assays with sequence-based readouts.

In the paper, Shendure and Aiden describe 24 different assays. Citing an analogy to language where "Wilhelm von Humboldt described language as a system that makes ‘infi- nite use of finite means’: despite a relatively small number of words and combinatorial rules, it is possible to express an infinite range of ideas," the authors presented assay evolution as a assemblage of a small number of experimental designs. This model is not limited to language. In biochemistry a small number of protein domains and effector molecules are combined, and slightly modified, in different ways to create a diverse array of enzymes, receptors, transcription factors, and signaling cascades.

Subway map from [1]*.

Shendure and Aiden go on show how the technical domains can be combined to form new kinds of assays using a subway framework, where one enters via a general approach (comparison, perturbation, or variation) and reaches the final sequencing destination. Stations along the way are specific techniques that are organized by experimental motifs including cell extraction, nucleic acid extraction, indirect targeting, exploiting proximity, biochemical transformation, and direct DNA or RNA targeting.

The review focused on the bench and made only brief reference to the informatics issues as part of the "rate limiters" of next-generation sequencing experiments. It is important to note that each assay will have its own data analysis methodology. That may seem daunting. However, like the assays, the specialized informatics pipelines and other analyses can also be developed from a common set of building blocks. At Geospiza we are very familiar with these building blocks and how they can be assembled to analyze the data from many kinds of assays. As a result, the GeneSifter system is the most comprehensive in terms of its capabilities to support a large matrix of assays, analytical procedures, and species. If you are considering adding next-generation sequencing to your research or your current informatics is limiting your ability to publish, check out GeneSifter.

1. Shendure, J., and Aiden, E. (2012). The expanding scope of DNA sequencing Nature Biotechnology, 30 (11), 1084-1094 DOI: 10.1038/nbt.2421

2. Kahvejian A, Quackenbush J, and Thompson JF (2008). What would you do if you could sequence everything? Nature biotechnology, 26 (10), 1125-33 PMID: 18846086

* Rights obtained from Rightslink number 3084971224414

Tuesday, December 4, 2012

Commonly Rare

Rare is the new common. The final month of the year is always a good time to review progress and think about what's next. In genetics, massively parallel next generation sequencing (NGS) technologies have been a dominating theme, and for good reason.

Unlike the previous high-throughput genetic analysis technologies (Sanger sequencing and microarrays), NGS allows us to explore genomes in far deeper ways and measure functional elements and gene expression in global ways.

What have we learned?

Distribution of rare and common variants. From [1]

The ENCODE project has produced a picture where a much greater fraction of the genome may be involved in some functional role than previously understood [1]. However, a larger theme has been related to observing rare variation, and trying to understand its impact on human health and disease. Because the enzymes that replicate DNA and correct errors are not perfect, each time a genome is copied a small number of mutations are introduced, on average between 35-80. Since sperm are continuously produced, fathers contribute more mutations than mothers, and the number of new mutations increases with the father's age [2]. While the number per child, with respect to their father's contributed three-billion base genome, is tiny, rare diseases and intellectual disorders can result.

A consequence is that the exponentially growing human population has accumulated a very large number of rare genetic variants [3]. Many of these variants can be predicted to affect phenotype and many more may modify phenotypes in yet unknown ways [4,5]. We are also learning that variants generally fall into two categories. They are either common to all populations or confined to specific populations (figure). More importantly, for a given gene the number of rare variants can vastly outnumber of the number of previously known common variants.

Another consequence of the high abundance of rare variation is how it impacts the resources that are used to measure variation and map disease to genotypes. For example, microarrays, which have been the primary tool of genome wide association studies utilize probes developed from a human reference genome sequence. When rare variants are factored in, many probes have several issues ranging from "hidden" variation within a probe to a probe simply not being able to measure a variant that is present. Linkage block size is also affected [6]. What this means it the best arrays going forward will be tuned to specific populations. It also means we need to devote more energy to developing refined reference resources, because the current tools do not adequately account for human diversity [6,7].

What's next?

Rare genetic variation has been understood for sometime. What's new is understanding just how extensive these variants are in the human population, which has resulted from the population recently rapidly expanding under very little selective pressure. Hence, linking variation to heath and disease is the next big challenge and the cornerstone of personalized medicine, or as some would like precision medicine. Conquering this challenge will require detailed descriptions of phenotypes, in many cases at the molecular level. As the vast majority of variants, benign or pathogenic, lie outside of coding regions we will need to deeply understand how those functional elements, as initially defined by ENCODE, are affected by rare variation. We will also need to layer in epigenetic modifications.

For the next several years the picture will be complex.

References:

1. 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), 56-65 PMID: 23128226

[2] Kong, A., et. al. (2012). Rate of de novo mutations and the importance of father’s age to disease risk Nature, 488 (7412), 471-475 DOI: 10.1038/nature11396

[3] Keinan, A., and Clark, A. (2012). Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants Science, 336 (6082), 740-743 DOI: 10.1126/science.1217283

[4] Tennessen, J., et. al. (2012). Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes Science, 337 (6090), 64-69 DOI: 10.1126/science.1219240

[5] Nelson, M., et. al. (2012). An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People Science, 337 (6090), 100-104 DOI: 10.1126/science.1217876

[6] Rosenfeld JA, Mason CE, and Smith TM (2012). Limitations of the human reference genome for personalized genomics. PloS one, 7 (7) PMID: 22811759

[7] Smith TM., and Porter SG. (2012) Genomic Inequality. The Scientist.

Thursday, July 12, 2012

Resources for Personalized Medicine Need Work

Yesterday (July 11, 2012), PLoS ONE published an article prepared by my colleagues and myself entitled "Limitations of the Reference Genome for Personalized Genomics."

This work, supported by Geospiza's SBIR targeting ways to improve mutation detection and annotation, explored some the resources and assumptions that are used to measure and understand sequence variation. As we know, a key deliverable of the human genome project was to produce a high quality reference sequence that could be used to annotate genes, develop research tools like genotyping and microarray assays, and provide insights to guide software development. Projects like HapMap used these resources to provide additional understandings in terms of genetic linkage in populations.

Decreasing sequencing costs

Since those early projects, DNA sequencing costs have plummeted. As a result, endeavors such as the 1000 Genomes Project (1KGP) and public contributions from Complete Genomics (CG) have dramatically increased the number of known sequence variants. A question worth asking is how do these new data contribute to an understanding of the utility of current resources and assumptions that have guided genomics and genetics for the past six or seven years?

Number of variants by dbSNP build

To address the above question, we evaluated several assay and software tools that were based on the human genome reference sequence in the context of new data contributed by 1KGP and CG. We found a high frequency of confounding issues with microarrays, and many cases where invalid assumptions, encoded in bioinformatics programs, underestimate variability or possibly misidentify the functional effects of mutations. For example, 34% of published array-based GWAS studies for a variety of diseases utilize probes that contain undocumented variation or map to regions of previously unknown structural variation. Similarly, assumptions about the size of linkage disequillibrium decrease as the numbers of variants increase.

The significance of this work is that it documents what many are anecdotally experiencing. As we continue to learn about the contributing role of rare variation in human disease we need to fully understand how current resources can be used and work to resolve discrepancies in order to create an era of personalized medicine.

(2012). Limitations of the Human Reference Genome for Personalized Genomics, PLoS ONE, DOI: 10.1371/journal.pone.0040294.t002

Tuesday, February 14, 2012

Sneak Peek: Poster Presentations at AGBT

The annual Advances in Genome Biology and Technology (AGBT) begins tomorrow and would not be complete without a couple of contributions by @finchtalk.

Follow the tweets at #AGBT and if you are at the conference visit posters 334 and 335 (abstracts below). Also, visit Lanai 189 to see the latest advances in genome technology and software from the Caliper and Geospiza organizations within PerkinElmer.

Poster Abstracts

Poster 335: Why is the $1000 Genome so Expensive?

Rapid advances in sequencing technology are enabling leading institutions to establish programs for genomics-based medicine. Some estimate that 5000 genomes were sequenced during 2011, and an additional 30,000 will be sequenced by the end of 2012. Despite this terrific progress, the infrastructure required to make genomics-based medicine a norm, rather than a specialized application, are lacking. Although DNA sequencing costs are decreasing, sample preparation bottlenecks and data handling costs are increasing. In many instances, the resources (e.g. time, capital investment, experience) required to effectively conduct medical-based sequencing is prohibitive.

We describe a model system that uses a variety of PerkinElmer products to address three problems that continue to impact the widescale adoption of genomics-based medicine: organizing and tracking sample information, sample preparation, and whole genome data analysis. Specifically, PerkinElmer’s GeneSifter® LIMS and analysis software, Caliper instrumentation, and DNA sequencing services can provide independent or integrated solutions for generating and processing data from whole-genome sequencing.

Poster 334: Limitations of the Human Reference Genome Sequence

The human genome reference sequence is well characterized, highly annotated, and its development represents a considerable investment of time and money. This sequence is the foundation for genotyping microarrays and DNA sequencing analysis. Yet, in several critical aspects the reference sequence remains incomplete as are the many research tools that are based on it. We have found that, when new variation data from 1000 Genome Project (1Kg) and Complete Genomics (CG) are used to measure the effectiveness of existing tools and concepts, approximately 50% of probes on commonly used genotyping arrays contain confounding variation, impacting the results of 37% of GWAS studies to date. The sources of confounding variation include unknown variants in close proximity to the probed variant and alleles previously assumed to be di-allelic that are poly-allelic. When mean linkage disequillibrium (LD) lengths from HapMap are compared to 1Kg data, LD decreases from 16.4 Kb to 7.0 Kb within common samples and further decreases to 5.4 Kb when random samples are compared.

While many of the observations have been anecdotally understood, quantitative assessments of resources based on the reference sequence have been lacking. These findings have implications for the study of human variation and medical genetics, and ameliorating these discrepancies will be essential for ushering in the era of personalized medicine.

Tuesday, January 24, 2012

Sneak Peek: Cost Effective Sequencing by Perkin Elmer

Wednesday, February 16, 2011

Sneak Peak: ABRF and Software Systems for Clinical Research

The Association for Biomedical Research Facilities conference begins this weekend (2/19) with workshops on Saturday and sessions Sunday through Tuesday. This year's theme is: Technologies to Enable Personalized Medicine, and appropriately a team from Geospiza will be there at our booth and participating in scientific sessions.

I will be presenting a poster entitled, "Clinical Systems for Cancer Research" (abstract below). In addition to great science and technology ABRF has a large number of tweeting participants including @finchtalk. You can follow along using the #ABRF and (or) #ABRF2011.

Abstract

By the end of 2011 we will likely know the DNA sequences for 30,000 human genomes. However, to truly understand how the variation between these genomes affect phenotype at a molecular level, future research projects need to analyze these genomes in conjunction with data from multiple ultra-high throughput assays obtained from large sample populations. In cancer research, for example, studies that examine 1000s of specific tumors in 1000s of patients are needed to fully characterize the more than 10,000 types and subtypes of cancer and develop diagnostic biomarkers. These studies will use high throughput DNA sequencing to characterize tumor genomes and their transcriptomes. Sequencing results will be validated with non-sequencing technologies and putative biomarkers will be examined in large populations using rapid targeted assay approaches.

Geospiza is transforming the above scenario from vision into reality in several ways. The Company’s GeneSifter platform utilizes scalable data management technologies based on open-source HDF5 and BioHDF technologies to capture, integrate, and mine raw data and analysis results from DNA, RNA, and other high-throughput assays. Analysis results are integrated and linked to multiple repositories of information that include variation, expression, pathway, and ontology databases to enable discovery process and support verification assays. Using this platform and RNA-Sequencing and Genomic DNA sequencing from matched tumor/normal samples, we were able to characterize differential gene expression, differential splicing, allele specific expression, RNA editing, somatic mutations and genomic rearrangements as well as validate these observations in a set of patients with oral and other cancers.

By: Todd Smith (1), N. Eric Olson (1), Rebecca Laborde (3), Christopher E Mason (2), David Smith (3): (1) Geospiza, Inc., Seattle WA. (2) Weil Cornell Medical College, NY NY.(3) Mayo Clinic, Rochester MN.

Tuesday, February 8, 2011

AGBT 2011

More.

That's how I describe this year's conference.

More attendees
More data
More genomes
More instruments
More tweeters
More tweeting controversy
More software
More ...

Feel free to add more comments.

Wednesday, December 15, 2010

Genomics and Public Health: Epidemiology in Haiti

While there is debate and discussion about how genomics will be used in public health at a personalized medicine level, it is clear that rapid high-troughput DNA sequencing has immediate utility in epidemiological applications that seek to understand the origins of disease outbreaks.

In the most recent application, published this month, researchers at Pacific Biosciences (PacBio) used their PacBio RS sequencing system to identify the origins of the cholera outbreak in Haiti. According to the article, cholera as been present in Latin America since 1991, but is had not been epidemic in Haiti for at least 100 years. When the recent outbreak began in October this year, it was important to determine the origins of the disease, especially since it was concluded that Haiti had a low cholera risk following the earthquake. Understanding the origins of a disease can help define virulence and resistance mechanisms to guide therapeutic approaches.

Sequencing organisms to discover their origins in outbreaks is not new. What is new is the speed at which this can be done. For example, it took two months for the SARS virus to be sequenced after the epidemic started. In the recent work, the sequencing was completed in four days. And, it was not just one isolate that was sequenced, but two, with 40 times larger genomes.

When the Haiti sequences were compared to the sequences of 23 other V. cholera strains, the data indicated that the Haiti strain matched strains from South Asia more closely than the endemic strains from Latin America. This finding tells us that the stain was likely introduced, perhaps by aid workers. Additional sequence analysis of the colera toxin genes also confirmed that the strain causing the epidemic produces more severe disease. From a public health perspective this is important because the less virulent, easier to treat, endemic stains can be displaced by more aggressive strains. The good news is that the new strain is sensitive to tetracycline, a first line antibiotic.

The above work clearly demonstrates how powerful DNA sequencing is in strain identification. The authors favored single molecule sequencing on PacBio because its cycle time is shorter than second generation technologies like Illumina, SOLiD, and 454 and its long read lengths better handle repeats.While these points may be argued by the respective instrument vendors, it is clear is that we are entering an era where we can go very quickly from isolating infectious agents to classifying them at very high resolution in unprecedented ways. DNA sequencing will have a significant role in diagnosing infectious agents.

Further reading:

Scientists Trace Origin of Recent Cholera Epidemic in Haiti, HHMI News
The Origin of the Haitian Cholera Outbreak Strain, NEJM 2010

Wednesday, September 29, 2010

A Genomics Genealogy

Deep sequencing technologies have radically changed how we study biology. Deciding what technology and software to use can be daunting. Choices become easier when the relationships between different DNA sequencing applications are understood.

A brief history

DNA sequencing grew from our desire to understand how the instructions for the biochemistry of life are encoded in an organism’s DNA. If we know the precise ordering and organization of an organism’s DNA sequence, we can presumably unlock a code that reveals these instructions. Accomplishing this goal required the creation of a new field, molecular biology, and new technologies to sequence genes.

The first sequencing methods were arduous. They combined nuclease digestion with thin layer chromatography to measure di- and trinucleotides that could be puzzled together. Later, Maxim and Gilbert replaced enzymatic DNA degradation with a chemical fragmentation method that enabled the reading of ordered bases from ³²P labeled fragments separated by electrophoresis.

The Sanger method, which used dideoxynucleotide triphosphates to create ensembles of DNA molecules terminated at each base, soon replaced Maxim Gilbert sequencing. The next innovation was to color code DNA with fluorescent dyes so that molecules could be interrogated with a laser and camera coupled to a computer. This innovation automated “high-throughput” DNA sequencing systems, initially with polyacrylamide gels and later with capillary electrophoresis, and made it possible to sequence the human and other genomes. It also created the first transcriptome analysis method, Expressed Tag Sequencing (EST).

Despite 20 years of advances, however, the high-throughput sequencing methods were not high-enough-throughput to realistically interrogate DNA and RNA molecules in creative ways. Big questions (genomes, ESTs, meta-genomes) required large factory-like approaches to automate sample preparation and collect sequences because a fundamental problem had yet to be solved. Specially, each sequence was obtained from an individual purified DNA clone or PCR product.

Real high-throughput is massively parallel throughput

The next-generation DNA sequencing (NGS) technologies free researchers from the need to clone or purify every molecule. They all share the common innovation that DNA sequencing is performed in a massively parallel format. That is a library, or ensemble of millions of DNA molecules, are simultaneously sequenced. Data collection costs are dramatically decreased through miniaturization and by eliminating the need for warehouses of colony pickers, prep robots, sequencing instruments, and large teams of people.

The new problem is dealing with the data that are produced and increasing computation costs. As NGS opens new possibilities to measure DNA and RNA in novel ways, each application requires a specific laboratory procedure that must be coupled to a specific analysis methodology.

Sequencing genealogy is defined by the questions

In an evolutionary model, the history of cloning, restriction site mapping, and Sanger sequencing form the trunk of the genomics application tree (top figure) from which branches develop as new applications emerge.

NGS has driven the evolution of three main sequencing branches: De Novo, Functional Genomics, and Variation Assays. The De Novo, or Exploratory, sequencing contains three subbranches that include new genomes (projects that seek to determine a complete genome sequence of an organism), meta-genomes (projects in which DNA fragments are sequenced from environmental samples), or meta-transcriptomes (projects where cDNA fragments are sequenced from environmental samples).

The Functional Genomics branch is growing fast. In these experiments, different collections of RNA or DNA molecules from an organism, tissue, or cells, are isolated and sequenced to measure gene expression and how it is regulated. Three subbranches describe the different kinds of function genomics: Expression, Regulation, and EpiGenomics, and each of these subbranches can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved) that are defined by laboratory protocols, kits, and instruments. When the experiments are refined and are made reproducible, they become assays.

Variation Assays form the third main branch of the tree. Genomic sequences are compared within and between populations to link genotype and phenotype. In special cases like cancer and immunology research, variation assays are used to observe changes within an organism’s somatic genomes over time. Today, variation, or resequencing, assays measure nucleotide and small insertions and deletions in whole genomes and exomes. If linked sequence strategies (mate-pairs, paired-ends) are used, larger structural changes including copy number variations can also be measured.

Why is this important?

As a software provider with both deep lab and analysis experience, we [Geospiza] are often asked questions about what instrument platform is the best or how our software stacks up against other available options. The answer, of course, depends on what you want to do. De Novo applications benefit from long reads offered by platforms like 454. Many of the assay-based applications demand ultra-deep sequencing with very high numbers of sequences (reads) as provided by the short-read platforms (Illumina, SOLiD). New single molecule sequencing platforms like PacBio's are targeting a wide rage of applications but have best been demonstrated, thus far, for long-read uses and novel methylation assays.

From an informatics perspective, the exploratory and assay-based branches have distinct software requirements. Exploratory applications require that reads be assembled into contigs that must be further ordered into scaffolds to get to the complete sequence. In meta-genomics or meta-transcriptomics applications, data are assembled to obtain gene sequences. These projects are further complicated by orthologous and paralogous sequences and highly expressed genes that over represent certain sequences. In these situations, specialized hardware or complex data reduction strategies are needed to make assembly practical. Once data are assembled, they are functionally annotated in a second computational phase using tools like BLAST.

Assay-based data analysis also has two distinct phases, but they are significantly different from De Novo sequencing. The first phase involves aligning (or mapping) reads to reference data sources and then reducing the aligned data into quantitative values. At least one reference is required and the better it is annotated the more informative the initial results will be. Alignment differs from assembly in that reads are separately compared to a reference rather than amongst themselves. Alignment processing capacity can be easily scaled with multiple inexpensive computers whereas assembly processing cannot.

The second phase of Assay-based sequencing is to produce a discrete output as defined by a diagnostic application, or compare the quantitative values computed from the alignments from several samples, obtained from different individuals and (or) treatments relative to controls. This phase requires statistical tools to normalize data, filter false positives and negatives, and measure differences. Assay-based applications become more informative when large numbers of samples and replicates are included in a study.

Connecting the dots

While the sequencing applications can be grouped and summarized in different ways, they are also interrelated. For example, De Novo projects are open-ended and exploratory, but their end product, a well-annotated reference sequence, is the foundation for Functional Genomics and Variation applications. Variation analysis is only useful if we can assign function to specific genotypes. Functional assignments come, in part, from previous experiments and genomic annotations, but are increasingly being produced by sequencing assays, so the new challenge is integrating that data obtained from different assays into coherent datasets that can link many attributes to a set of genotypes.

NGS clearly opens new possibilities for studying and characterizing biological systems. Different applications require different sequencing platforms, laboratory procedures, and software systems that can organize analysis tools and automate data processing. On this last point, as one evaluates their projects and their options for being successful, they need to identify informatics groups that have deep experience, available solutions, and strong capabilities to meet the next challenges. Geospiza is one such group.

Further Reading

DNA Sequencing History

Gilbert W, Maxam A (1973) The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70:3581

Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74:560

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463-7

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321:674-9

Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cdna sequencing (expressed sequence tags) from a directionally cloned human infant brain cdna library. Nat Genet 4:373-80

International Human Genome Sequencing Consortium, 2001. “Initial sequencing and analysis of the human genome.” Nature 409, 860-921.

Venter J.C., Adams M.D., Myers E.W., et. al. 2001. “The sequence of the human genome.” Science 291, 1304-1351.

FinchTalks

From Reads to Datasets Why Next Gen is Not Like Sanger
Expeditiously Exponential: Genome Standards in a New Era

Next Gen DNA Sequencing Is Not Sequencing DNA
Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Tuesday, June 29, 2010

GeneSifter Lab Edition 3.15

Last week we released GeneSifter Laboratory Edition (GSLE) 3.15. From NGS quality control data, to improved microarray support, to Sanger sequencing support, to core lab branding, and many others, there are a host of features and improvements for everyone that continue to make GSLE the leading LIMS for genetic analysis.

The three big features include QC Analysis of Next Generation Sequencing (NGS) Data and Microarrays, and core lab branding support.

To better troubleshoot runs, and view data quality for individual samples in a multiplex, the data within fastq, fasta, or csfasta (and quality) files are used to generate quality report graphics (figure below). These include the overall base (color) composition, average, per base, quality values (QVs), box and whisker plots showing median, lower and upper quartile, and minimum and maximum QVs at each base position, and error analysis indicating the number of QVs below 10, 20 and 30. A link is also provided to conveniently view the sequence data in pages, so that GBs of data do not stream into your browser.
For microarray labs, quality information from CHP and CEL files and a probe intensity data from CEL files are displayed. Please contact support@geospiza.com to activate of Affymetrix Settings and configure the CDF file path and power tools.
For labs that use their own ordering systems, a GSLE data view page has been created that can be embedded in a core lab website. To support end user access, a new user role, Data Viewer, has been created to limit access to only view folders and data within the user's lab group. Please contact support@geospiza.com to activate the feature.
The ability to create html tables in the Welcome Message for the Home page has been returned to provide additional message formatting capabilities.

Laboratory Operations
Several features and improvements introduced in 3.15 help prioritize steps, update items, improve ease of use, and enhance data handling.

Instrument Runs

A time/date stamp has been added to the Instrument Run Details page to simplify observing when runs were completed.
Partial Sanger (CE) runs can be manually completed (like NGS runs) instead of having to have all reactions be complete or fail the remaining.
The NGS directory view of result files now provides deletion actions (by privileged users) so labs can more easily manage disk usage.

Sample Handling

Barcodes can be recycled or reused for templates that are archived to better support labs using multiple lots of 2D barcode tubes. However, the template barcode field remains unique for active templates.
Run Ready Order Forms allow a default tag for the Plate Label to populate the auto-generated Instrument Run Name to make Sanger run set up quicker.
The Upload Location Map action has been moved to the side menu bar under Lab Setup to ease navigation.
The Template Workflows “Transition to the Next Workflow” action is now in English: “Enter Next Workflow.”
All Sanger chromatogram download options are easier to see and now include the option to download .phd formatted files.
The DNA template location field can be used to search for a reaction in a plate when creating a reaction plate.
To redo a Sanger reaction with a different chemistry, the chemistry can now be changed when either Requeuing for Reacting is chosen, or Edit Reactions from within a Reaction Set is selected.

Orders and Invoices
More efficient views and navigation have been implemented for Orders and Invoices.

When Orders are completed, the total number of samples and the number of results can be compared on the Update Order Status page to help identify repeated reactions.
A left-hand navigation link has been added for core lab customers to review both Submitted and Complete Invoices. The link is only active when invoicing is turned on in the settings.

System Management
Several new system settings now enable GSLE to be more adaptable at customer sites.

The top header bar time zone display can be disabled or configured for a unique time zone to support labs with customers in different time zones.
The User account profile can be configured to require certain fields. In addition, if Lab Group is not required, then Lab Groups are created automatically.
Projects within GSLE can be inactivated by all user roles to hide data not being used.

Application Programming Interface
Several additions to the self-documenting Application Programming Interface (API) have been made.

An upload option for Charge Codes within the Invoice feature was added.
Form API response objects are now more consistent.
API keys for user accounts can be generated in bulk.
Primers can be identified by either label or ID.
Events have been added. Events provide a mechanism to call scripts or send emails (beyond the current defaults) when system objects undergo workflow changes.

Presently, API's can only be activated on local, on-site, installations.

Friday, May 21, 2010

I Want My FinchTV*

Without a doubt FinchTV is a wildly successful sequence trace viewer. Since it’s launch, close to 150,000 researchers and students have enjoyed its easy to use interface, cross platform capabilities, and unique features. But, where does it go from here?

Time for Version 2

FinchTV is Geospiza's free DNA sequence trace viewer that is used on Macintosh, Windows, and Linux computers to open and view DNA sequence data from Sanger-based instruments. It reads both AB1 and SCF files and displays the DNA sequence and four color traces of the corresponding electropherogram. When quality values are present, they are displayed with the data. Files can be opened with a simple drag and drop action, and once opened, traces can be viewed in a either a single pane or multi-pane full sequence format. Sequences can be searched using regular expressions, edited, and regions selected and used to launch NCBI BLAST searches.

Over the past years we have learned that FinchTV is used in many kinds of environments suchas research labs, biotechnology and pharmaceutical companies, and educational settings. In some labs, it is the only tool people use to work with their data. We’ve also collected a large number of feature requests that include viewing protein translations, performing simple alignments, working with multiple sequences, changing the colors of the electropherogram tracings, and many others.

Free software is not built from free development

FinchTV was originally developed under an SBIR grant as a prototype for cross platform software development. Until then, commercial quality trace viewers ran on either Windows or Macintosh, never both. Cross platform viewers were crippled versions of commercial programs, and none of the programs incorporated modern GUI (graphical user interface) features and were cumbersome to use.

FinchTV is a high quality, full featured, free program; we want to improve the current version and keep it free. So, the question becomes how to keep a free product up to date?

One way is through grant funding. Geospiza believes a strong case can be made to develop a new version of FinchTV under an SBIR grant because we know Sanger sequencing is still very active. From the press coverage, one would think next generation DNA sequencing (NGS) is going to be the way all sequencing will soon be done. True there are many projects where Sanger is no longer appropriate, but NGS cannot do small things, like confirm clones. Sanger sequencing also continues to grow in the classroom, hence tools like FinchTV are great as education resources.

We think there are more uses too, so we’d like to hear your stories.

How do you use FinchTV?

What would you like FinchTV to do?

Send us a note (info at geospiza.com) or, even better, add a comment below. We plan to submit the proposal in early August and look forward to hearing your ideas.

* apologies to Dire Straights "Money for Nothing"

Tuesday, February 23, 2010

GeneSifter Lab Edition v3.14 - Release Notes

GeneSifter Laboratory Edition (GSLE) 3.14.0 introduces a host of new features and capabilities that make daily laboratory data management work even easier. Read below to learn why GSLE is a leading LIMS product for all forms of DNA sequencing, microarrays, and other genetic analysis applications.

Orders and Invoices

Multi plate submissions: Order forms have been extended in several ways to further simplify how labs collect sample and project information. A new order form template lets core facilities, managing larger sequencing projects, easily receive samples and their information in a multiple plate format. New order fields specific to the plate format are included to support sample tracking and lab work.

Add data to fields: Orders forms have been further improved by adding the ability to add new values (or terms) to dropdown fields that already exist on published order forms.

 Project field: Additionally, labs can add an optional project field to forms. With these improvements, labs can create forms that are easier to use and modify, as well as enable project tracking for their customers.

Sample location and sample selection: Two new features deliver help for labs that provide sample storage (biobanking) services to their clients. First, order forms can include sample location information. This is particularly useful in situations where samples are delivered in 96-well plates that are stored for later use. Second, samples already stored by the lab as purified DNA, RNA or other material (templates) can be selected from specialized search interfaces within order forms. Like all GSLE sample entry forms, these features can be included or not on a case-by-case basis depending on your specific needs.

Invoice formatting: For labs that have the dreaded chore of sending billing data to accounting departments we have added the ability to modify the invoice number format to include additional characters that are used to distinguish which labs are sending information.

Laboratory Operations

GSLE provides the ability to create, list and follow steps in sample protocols (also called workflows). In 3.14 new features not only expand the capabilities but make it possible to further standardize procedures.  

Multiplexing: In Next Generation Sequencing (NGS) several libraries are often combined into a single lane or region of a slide to increase the number of individual samples analyzed in a sequencing run. As each library is prepared, a specific adaptor sequence is added so sequence reads corresponding to different samples can be identified by their adaptor tag. This procedure, called multiplexing or barcoding, is supported in 3.14 and allows the lab to combine samples and adaptor sequences and group the combination of libraries together (Worksets) for sample processing and instrument runs. Once data are collected, sample naming conventions, combined with adaptor sequence (Multiplex Identifier, MID) stored in sample sheets, are used to separate individual reads into files corresponding to the samples that were in the original workset.

Batch data entry: Some lab processes require that samples are manipulated in groups (batches), but laboratory data are collected for individual samples within the batch. For example, the concentrations of individual DNA samples may need to be measured in a 96-well plate. To improve how the OD values, comments, or other information are entered, workflow steps have been updated to include batch data entry forms that provide spreadsheet like data entry capabilities. Like all GSLE batch data entry forms, data can be entered easily using the form’s column highlight and easy fill controls, or uploaded from an excel spreadsheet.

Subsample processing: GSLE 3.14 also increases sample processing flexibility. As noted above, order forms can now support the ability to select samples that are already stored in the system. This feature is further extended into the laboratory by creating tools that allow many new samples to be created from a “parent” or stock samples. When new samples (templates) are created, options are provided so that each new sample can be entered into a different process. For example, you receive a tissue sample that needs several experiments performed; RNA-Seq, ChIP-Seq and resequencing. Now you can easily pick the sample and create three new sub samples defining which process will be performed on each sample with just a few clicks.

Selecting samples based on custom data: Some labs need to use custom data entered into order forms to sort and filter samples in the lab. For example, an order form may ask a researcher to enter read lengths for their NGS run. A 36 base run is much faster than a 100 base run, and on some platforms costs less. Thus, the lab will sort samples based on read length prior to the data collection event. While always possible to get this information in many GSLE displays, 3.14 adds new capabilities to use any custom data in its specialized sample picker tools.

Other Features

Customer data management: GSLE v3.14 gives labs’ customers increased ability to organize their chromatograms, fragment analysis files and microarray files as needed. Data files can be edited, relabeled, moved or deleted. Projects and folders can be created, modified or deleted to aid in data organization.

Application Programming Interface (Onsite Installations Only)

SQL-API: As automation and system integration needs increase, requirements for supporting programmatic data entry become more important. GSLE has continued to expand the self-documenting Application Programming Interface (API). We have also added an SQL API that can be used to create custom reports that are accessed via a wget style unix command.

 Input API enhancements: The Input API now returns success IDs and CGI parameter names have been eliminated. The full documentation can be reviewed by contacting support@geospiza.com for the GSLE SQL API Manual or the GSLE Input API Manual.  

Next Generation Analysis Transfer Tool (Hosted Partners Only)

Simplified data transfers: A data transfer interface has been added to connect GSLE and GeneSifter Analysis Edition (GSAE). Partner Program administrators use the interface to select data files in GSLE and transfer them to their customer’s account in GSAE.

Schema Table update note

There was an update to an existing schema table; the column "Plate_Label" is now in table om_sample_plate instead of om_order.

Thursday, October 8, 2009

Resequencing and Cancer

Yesterday we released news about new funding from NIH for a project to work on ways to improve how variations between DNA sequences are detected using Next Generation Sequencing (NGS) technology. The project emphasizes detecting rare variation events to improve cancer diagnostics, but the work will support a diverse range of resequencing applications.

Why is this important?

In October 2008, the U.S. News and World Report published an article by Bernadine Healy, former head of NIH. The tag line “understanding the genetic underpinnings of cancer is a giant step toward personalized medicine,” (1) underscores how the popular press views the promise of recent advances in genomics technology in general, and the progress toward understanding the molecular basis of cancer. In the article, Healy presents a scenario where, in 2040, a 45-year-old woman, who has never smoked, develops lung cancer. She undergoes outpatient surgery, and her doctors quickly scrutinize the tumor’s genes and use a desktop computer to analyze the tumor genomes, and medical records to create a treatment plan. She is treated, the cancer recedes and subsequent checkups are conducted to monitor tumor recurrence. Should a tumor be detected, her doctor would quickly analyze the DNA of a few of the shed tumor cells and prescribe a suitable next round of therapy. The patient lives a long happy life, and keeps her hair.

This vision of successive treatments based on genomic information is not unrealistic, claims Healy, because we have learned that while many cancers can look homogeneous in terms of morphology and malignancy they are indeed highly complex and varied when examined at the genetic level. The disease of cancer is in reality a collection of heterogeneous diseases that, even for common tissues like the prostate, can vary significantly in terms of onset and severity. Thus, it is often the case that cancer treatments, based on tissue type, fail, leaving patients to undergo a long painful process of trial and error therapies with multiple severely toxic compounds.

Because cancer is a disease of genomic alterations, understanding the sources, causes, and kinds of mutations, and their connection to specific types of cancer, and how they may predict tumor growth is worthwhile. The human cancer genome project (2) and initiatives like the international cancer genome consortium (3) have demostrated this concept. The kinds of mutations found in tumor populations, thus far by NGS, include single nucleotide polymorphisms (SNPs), insertions and deletions, and small structural copy number variations (CNVs) (4, 5). From early studies it is clear that a greater amount of genomic information will be needed to make Healy's scenario a reality. Next generation sequencing (NGS) technologies will drive this next phase of research and enable our deeper understanding.

Project Synopsis

The great potential for the clinical applications of new DNA sequencing technologies comes from their highly sensitive ability to assess genetic variation. However, to make these technologies clinically feasible, we must assay patient samples at far higher rates than can be done with current NGS procedures. Today, the experiments applying NGS, in cancer research have investigated small numbers of samples in great detail, in some cases comparing entire genomes from tumor and normal cells from a single patient (6-8). These experiments show, that when a region is sequenced with sufficient coverage, numerous mutations can be identified.

To move NGS technologies into clinical use many costs must decrease. Two ways costs can be lowered are to increase sample density and reduce the number of reads needed per sample. Because cost is a function of turnaround time and read coverage, and read coverage is a function of the signal to noise ratio, assays with a higher background noise, due to errors in the data, will require higher sampling rates to detect true variation and be more expensive. To put this in context, future cancer diagnostic assays will likley need to look at over 4000 exons per test. In cases like bladder cancer, or cancers where stool or blood are sampled, non-invasive tests will need to detect variations in one out of 1000 cells. Thus it is extremely important that we understand signal/noise ratios and to be able to calculate read depth in a reliable fashion.

Currently we have a limited understanding of how many reads are needed to detect a given rare mutation. Detecting mutations depends on a combination of sequencing accuracy and depth of coverage. The signal (true mutations) to noise (false mutations, hidden mutations) depends on how many times we see a correct result. Sequencing accuracy is affected by multiple factors that include sample preparation, sequence context, sequencing chemistry, instrument accuracy, and basecalling software. The current depth-of-coverage calculations are based on an assumption that sampling is random, which is not valid in the real world. Corrections will have to be applied to adjust for real-world sampling biases that affect read recovery and sequencing error rates (9-11).

Developing clinical software systems that can work with NGS technologies, to quickly and accurately detect rare mutations, requires a deep understanding of the factors that affect the NGS data collection and interpretation. This information needs to be integrated into decision control systems that can, through a combination of computation and graphical displays, automate and aid a clinician’s ability to verify and validate results. Developing such systems are major undertakings involving a combination of research and development in the areas of laboratory experimentation, computational biology, and software development.

Positioned for Success

Detecting small genetic changes in clinical samples is ambitious. Fortunately, Geospiza has the right products to deliver on the goals of the research. GeneSifter Lab Edition handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. The data management and analysis server make the system scalable through a distributed architecture. Combined with GeneSifter Analysis Edition, a complete platform is created to rapidly prototype new data analysis workflows needed to test new analysis methods, experiment with new data representations, and iteratively develop data models to integrate results with experimental details.

References:

Press Release: Geospiza Awarded SBIR Grant for Software Systems for Detecting Rare Mutations

1. Healy, 2008. "Breaking Cancer's Gene Code - US News and World Report" http://health.usnews.com/articles/health/cancer/2008/10/23/breaking-cancers-gene-code_print.htm

2. Working Group, 2005. "Recommendation for a Human Cancer Genome Project" http://www.genome.gov/Pages/About/NACHGR/May2005NACHGRAgenda/ReportoftheWorkingGrouponBiomedicalTechnology.pdf

3. ICGC, 2008. "International Cancer Genome Consortium - Goals, Structure, Policies &Guidelines - April 2008" http://www.icgc.org/icgc_document/

4. Jones S., et. al., 2008. "Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses." Science 321, 1801.

5. Parsons D.W., et. al., 2008. "An Integrated Genomic Analysis of Human Glioblastoma Multiforme." Science 321, 1807.

6. Campbell P.J., et. al., 2008. "Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing." Proc Natl Acad Sci U S A 105, 13081-13086.

7. Greenman C., et. al., 2007. "Patterns of somatic mutation in human cancer genomes." Nature 446, 153-158.

8. Ley T.J., et. al., 2008. "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome." Nature 456, 66-72.

9. Craig D.W., et. al., 2008. "Identification of genetic variants using bar-coded multiplexed sequencing." Nat Methods 5, 887-893.

10. Ennis P.D.,et. al., 1990. "Rapid cloning of HLA-A,B cDNA by using the polymerase chain reaction: frequency and nature of errors produced in amplification." Proc Natl Acad Sci U S A 87, 2833-2837.

11. Reiss J., et. al., 1990. "The effect of replication errors on the mismatch analysis of PCR-amplified DNA." Nucleic Acids Res 18, 973-978.

Tuesday, October 6, 2009

From Blue Gene to Blue Genome? Big Blue Jumps In with DNA Transistors

Today, IBM announced that are getting into the DNA sequencing business and race for the $1,000 dollar genome by winning a research grant to explore new sequencing technology based on nanopore devices they call DNA transistors.

IBM news travels fast. Genome Web and The Daily Scan covered the high points and Genetic Future presented a skeptical analysis of the news. You can read the original news at the IBM site, and enjoy a couple of videos.

A NY Times article a listed a couple of facts that I thought were interesting: First, IBM becomes the 17th company to pursue the next next-gen (or third-generation) technology. Second, according to George Church, in the past five years the cost of collecting DNA sequence data has decreased by 10 fold annually and is expected to continue decreasing at a similar pace for the next few years.

But what does this all mean?

It is clear from this and other news that DNA sequencing is fast becoming a standard way to study genomes, gene expression, and measure genetic variation. It is also clear the while the cost of DNA sequencing is decreasing at a fast rate, the amount of data being produced is increasing at a similarly fast rate.

While some of the articles above discussed the technical hurdles nanopore sequencing must overcome, none discussed the real challenges researchers face today with using the data. The fact is, for most groups, the current next-gen sequencers are under utilized because the volumes of data combined with the complexity of data analysis has created a significant bioinformatics bottleneck.

Fortunately, Geospiza is clearing data analysis barriers by delivering access to systems that provide standard ways of working with the data and visualizing results. For many NGS applications, groups can upload their data to our servers, align reads to reference data sources, and compare the resulting output across multiple samples in efficient and cost effective processes.

And, because we are meeting the data analysis challenges for all of the current NGS platforms, we'll be ready for whatever comes next.

Saturday, September 12, 2009

Sneak Peak: Sequencing the Transcriptome: RNA Applications for Next Generation Sequencing

Join us this coming Wednesday, September 16, 2009 10:00 am Pacific Daylight Time (San Francisco, GMT-07:00), for a webinar on whole transcriptome analysis. In the presentation you will learn about how GeneSifter Analysis Edition can be used to identify novel RNAs and novel splice events within known RNAs.

Abstract:

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling, Whole Transcriptome Sequencing and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA applications using data from the NCBI's GEO database and Short Read Archive with an emphasis on converting raw data into biologically meaningful datasets. Data analysis examples will focus on methods for identifying differentially expressed genes, novel genes, differential splicing and 5’ and 3’ variation in miRNAs.

To register, please visit the event page.

Sunday, July 12, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part V: Why HDF5?

Through the course of this BioHDF bloginar series, we have demonstrated how the HDF5 (Hierarchical Data Format) platform can successfully meet current and future data management challenges posed by Next Generation Sequencing (NGS) technologies. We now close the series by discussing the reasons why we chose HDF5.

For previous posts, see:

The introduction
Project background
Challenges of working with NGS data
HDF5 benefits for working with NGS data

Why HDF5?

As previously discussed, HDF technology is designed for working with large amounts of complex data that naturally organize into multidimensional arrays. These data are composed of discrete numeric values, strings of characters, images, documents, and other kinds of data that must be compared in different ways to extract scientific information and meaning. Software applications that work with such data must meet a variety of organization, computation, and performance requirements to support the communities of researchers where they are used.

When software developers build applications for the scientific community, they must decide between creating new file formats and software tools for working with the data, or adapting existing solutions that already meet a general set of requirements. The advantage of developing software that's specific to an application domain is that highly optimized systems can be created. However, this advantage can disappear when significant amounts of development time are needed to deal with the "low-level" functions of structuring files, indexing data, tracking bits and bytes, making the system portable across different computer architectures, and creating a basic set of tools to work with the data. Moreover, such a system would be unique, with only a small set of users and developers able to understand and share knowledge concerning its use.

The alternative to building a highly optimized domain-specific application system is to find and adapt existing technologies, with a preference for those that are widely used. Such systems benefit from the insights and perspective of many users and will often have features in place before one even realizes they are needed. If a technology has widespread adoption, there will likely be a support group and knowledge base to learn from. Finally, it is best to choose a solution that has been tested by time. Longevity is a good measure of the robustness of the various parts and tools in the system.

HDF: 20 Years in Physical Sciences

Our requirements for high-performance data management and computation system are these:

Different kinds of data need to be stored and accessed.
The system must be able to organize data in different ways.
Data will be stored in different combinations.
Visualization and computational tools will access data quickly and randomly.
Data storage must be scalable, efficient, and portable across computer platforms.
The data model must be self describing and accessible to software tools.
Software used to work with the data must be robust, and widely used.

HDF5 is a natural fit. The file format and software libraries are used in some of the largest data management projects known to date. Because of its strengths, HDF5 is independently finding its way into other bioinformatics applications and is a good choice for developing software to support NGS.

HDF5 software provides a common infrastructure that allows different scientific communities to build specific tools and applications. Applications using HDF5 typically contain three parts: one or more HDF5 files to store data, a library of software routines to access the data, and the tools, applications and additional libraries to carry out functions that are specific to a particular domain. To implement an HDF5-based application, a data model be developed along with application specific tools such as user interfaces and unique visualizations. While implementation can be a lot of work in its own right, the tools to implement the model and provide scalable, high-performance programmatic access to the data have already been developed, debugged, and delivered through the HDF I/O (input/output) library.

In earlier posts, we presented examples where we needed to write software to parse fasta formatted sequence files and output files from alignment programs. These parsers then called routines in the HDF I/O library to add data to the HDF5 file. During the import phase, we could set different compression levels and define the chunk size to compress our data and optimize access times. In these cases, we developed a simple data model based on the alignment output from programs like BWA, Bowtie, and MapReads. Most importantly, we were able to work with NGS data from multiple platforms efficiently, with software that required weeks of development rather than the months and years that would be needed if the system was built from scratch.

While HDF5 technology is powerful "out-of-the-box," a number of features can still be added to make it better for bioinformatics applications. The BioHDF project is about making such domain-specific extensions. These are expected to include modifications to the general file format to better support variable data like DNA sequences. I/O library extensions will be created to help HDF5 "speak" bioinformatics by creating APIs (Application Programming Interfaces) that understand our data. Finally, sets of command line programs and other tools will be created to help bioinformatics groups get started quickly with using the technology.

To summarize, the HDF5 platform is well-suited for supporting NGS data management and analysis applications. Using this technology, groups will be able to make their data more portable for sharing because the data model and data storage are separated from the implementation of the model in the application system. HDF5's flexibility for the kinds of data it can store, makes it easier to integrate data from a wide variety of sources. Integrated compression utilities and data chunking make HDF5-based systems as scalable as they can be. Finally, because the HDF5 I/O library is extensive and robust, and the HDF5 tool kit includes basic command-line and GUI tools, a platform is provided that allows for rapid prototyping, and reduced development time, thus making it easier to create new approaches for NGS data management and analysis.

For more information, or if you are interested in collaborating on the BioHDF project, please feel free to contact me (todd at geospiza.com).

Monday, July 6, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part IV: HDF5 Benefits

Now that we're back from Alaska and done with the 4th of July fireworks, it's time to present the next installment of our series on BioHDF.

HDF highlights

HDF technology is designed for working with large amounts of scientific data and is well suited for Next Generation Sequencing (NGS). Scientific data are characterized by very large datasets that contain discrete numeric values, images, and other data, collected over time from different samples and locations. These data naturally organize into multidimensional arrays. To obtain scientific information and knowledge, we combine these complex datasets in different ways and (or) compare them to other data using multiple computational tools. One difficulty that plagues this work is that the software applications and systems for organizing the data, comparing datasets, and visualizing the results are complicated, resource intensive, and challenging to develop. Many of the development and implementation challenges can be overcome using the HDF5 file format and software library

Previous posts have covered:
1. An introduction
2. Backg round of the project
3. Complexities of NGS data analysis and performance advantages offered by the HDF platform.

HDF5 changes how we approach NGS data analysis.

As previously discussed, the NGS data analysis workflow can be broken into three phases. In the first phase (primary analysis) images are converted into short strings of bases. Next, the bases, represented individually or encoded as di-bases (SOLiD), are aligned to reference sequences (secondary analysis) to create derivative data types such as contigs or annotated tables of alignments, that are further analyzed (tertiary analysis) in comparative ways. Quantitative analysis applications, like gene expression, compare the results of secondary analyses between individual samples to measure gene expression and identify mRNA isoforms, or make other observations based on a sample’s origin or treatment.

The alignment phase of the data analysis workflow creates the core information. During this phase, reads are aligned to multiple kinds of reference data to understand sample and data quality, and obtain biological information. The general approach is to align reads to sets of sequence libraries (reference data). Each library contains a set of sequences that are annotated and organized to provide specific information.

Quality control measures can be added at this point. One way to measure data quality is to ask how many reads were obtained from constructs without inserts. Aligning the read data to a set of primers (individually and joined in different ways) that were used in the experiment, allows us to measure the number reads that match and how well they match. A higher quality dataset will have a larger proportion of sequences matching our sample and a smaller proportion of sequences that only match the primers. Similarly, different biological questions can be asked using libraries constructed of sequences that have biological meaning.

Aligning reads to sequence libraries is the easy part. The challenge is analyzing the alignments. Because the read datasets in NGS assays are large, organizing alignment data into forms we can query is hard. The problem is simplified by setting up multistage alignment processes as a set of filters. That is, reads that match one library are excluded from the next alignment. Differential questions are then asked by counting the numbers of reads that match each library. With this approach, each set of alignments is independent of the other alignments and a program only needs to analyze one set of alignments at time. Filter-based alignment is also used to distinguish reads with perfect matches from those with one or more mismatches.

Still, filter-based alignment approaches have several problems. When new sequence libraries are introduced, the entire multistage alignment process must be repeated to update results. Next, information about reads that have multiple matches in different libraries, or perfect matches and non-perfect matches within a library are lost. Finally, because alignment formats between programs differ and good methods for organizing alignment data do not exist, it is hard to compare alignments between multiple samples. This last issue also creates challenges for linking alignments to the original sequence data and formatting information for other tools.

As previously noted, solving the above problems requires that alignment data be organized in ways that facilitate computation. HDF5 provides the foundation to organize and store both read and alignment data to enable different kinds of data comparisons. This ability is demonstrated by the following two examples.

In the first example (left), reads from different sequencing platforms (SOLiD, Illumina, 454) were stored in HDF5. Illumina RNA-Seq reads from three different samples were aligned to the human genome and annotations from a UCSC GFF (genome file format) file were applied to define gene boundaries. The example shows the alignment data organized into three HDF5 files, one per sample, but in reality the data could have been stored in a single file or files organized in other ways. One of HDF's strengths is that the HDF5 I/O library can query multiple files as if they were a single file, providing the ability to create the high-level data organizations that are the most appropriate for a particular application or use case. With reads and alignments structured in these files, it is a simple matter to integrate data to view base (color) compositions for reads from different sequencing platforms, compare alternative splicing between samples, and select a subset of alignments from a specific genomic region, or gene, in a "wig" format for viewing in a tool like the UCSC genome browser.

The second example (right) focuses on how organizing alignment data in HDF5 can change how differential alignment problems are approached. When data are organized according to a model that defines granularity and relationships, it becomes easier to compute all alignments between reads and multiple reference sources, than think about how to perform differential alignments and implement the process. In this case, a set of reads (obtained from cDNA) are aligned to primers, QC data (ribosomal RNA [rRNA] and mitochondrial DNA [mtDNA]), miRBase, refseq transcripts, the human genome, and a library of exon junctions. During alignment up to three mismatches are tolerated between a read and its hit. Alignment data are stored in HDF5 and, because the data were not filtered, a greater variety of questions can be asked. Subtractive questions mimic the differential pipeline where alignments are used to filter reads from subsequent steps. At the same time, we can also ask "biological" questions about the number of reads that came from rRNA or mtDNA or from genes in the genome or exon junctions. And for these questions, we can examine the match quality between each read and its matching sequence in the reference data sources, without having to reprocess the same data multiple times.

The above examples demonstrate the benefits of being able to organize data into structures that are amenable to computation. When data are properly structured, new approaches that expand the ways in which data are analyzed can be implemented. HDF5 and its library of software routines move the development process from activities associated with optimizing the low level infrastructures needed to support such systems to designing and testing different data models and exploiting their features.

The final post of this series will cover why we chose to work with HDF5 technology.