Thursday, December 31, 2009

2009 Review

The end of the year is a good time to reflect, review accomplishments, and think about the year to come. 2009 was a good year for Geospiza’s customers, with many exciting accomplishments for the company. Highlights are reviewed below.

Two products form a complete genetic analysis system

Geospiza’s two core products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), help laboratories do their work and scientists analyze their data. GSLE is the LIMS (Laboratory Information Management System) that laboratories, from service labs to high-throughput data production centers, use to collect information about samples, track and manage laboratory procedures, organize and process data, and deliver data and results back to researchers. GSLE supports traditional DNA sequencing (Sanger), fragment analysis, genotyping, microarrays, Next Generation Sequencing (NGS) and other technologies.

In 2008, Geospiza released the third version of the platform (back then it was known as FinchLab). This version launched a new way of providing LIMS solutions. Traditional LIMS systems require extensive programming and customization to meet a laboratory’s specific requirements. They include a very general framework designed to support a wide range of activities. Their advantage is that they are highly customizable. However, this advantage comes at the expense of very high acquisition costs accompanied by lengthy requirements planning and programming before they become operational.

In contrast, GSLE contains default settings that support genetic analysis out-of-the-box, while allowing laboratories to customize operations without programmer support. Default settings in GSLE suppport DNA sequencing, microarray, and genotyping services. The GSLE abstraction layer supports extensive configuration to meet specific needs as they arise. Through this design, the costs of acquiring and operating a high-quality advanced LIMS system are significantly reduced.

Throughout 2009, 100’s of features were added to GSLE to increase support for instruments and data types, and improve how laboratory procedures (workflows) are created, managed, and shared. Enhancements were made to features like experiment ordering, organization, and billing. We also added new application programming interfaces (APIs) to enable integration with enterprise software. Specific highlights included:
  • Extending microarray support to include sample sheet generation and automate uploading files
  • Improving NGS file and data browsing to simplify the process of searching and viewing the 1000’s of files produced in Next Gen sequencing runs
  • Making NGS data downloads, of very large gigabase files, robust and easy
  • Adding worksets to group DNA and RNA samples in customized ways that facilitate laboratory processing
  • Creating APIs to utilize external password servers and programmatically receive data using GSLE form objects
  • Enhancing ways for groups to add HTML to pages to customize their look and feel
In addition to the above features, we’ve also completed development on methods to multiplex NGS samples and track MIDs (molecular identifiers and molecular barcodes), enter laboratory data like OD values and bead counts in batches, create orders with multiple plates, and access SQL queries through an API. Look for these great features and more in the early part of 2010.


As noted, GSAE is Geospiza’s data analysis product. While GSLE is capable of running of running advanced data analysis pipelines, the primary focus of data analysis in GSLE is to provide quality control. Thus its data analyses and presentation focus on single samples. GSAE provides the infrastructure and tools to compare the results between samples. In the case of NGS, GSAE also provides more reports and data interactions. GSAE began as a web-based microarray data analysis platform making it well suited for NGS-based gene expression assays. Over 2009 many new features were added to extend its utility to NGS data analysis with a focus on whole transcriptome analysis. Highlights included:
  • Developing data analysis pipelines for RNA-Seq, Small RNA, ChIP-Seq, and other kinds of NGS assays
  • Adding tools to visualize and discover alternatively spliced transcripts in gene expression assays
  • Extending expression analysis tools to include interactive volcano plots, unbalanced two-way ANOVA computations
  • Increasing NGS transcriptome analysis capabilities to include variation detection and visualization
The above features fulfill the requirements needed to make a platform complete for both NGS and microarray-based gene expression analysis. And, the addition of variation detection and visualization lays the groundwork for GSAE to extend its market leadership to resequencing data analysis.

Geospiza Research

In 2009 Geospiza won two research awards in the form of Phase II STTR and Phase I SBIR grants. The STTR project is researching new ways to organize, compress, and access NGS data by adapting HDF technologies to bioinformatics. Through this work we are developing a robust data management infrastructure that supports our NGS sequencing analysis pipelines and interactive user interfaces. The second award targets NGS-based variation detection. This work began in the last quarter of the year, but is already delivering new ways to identify and visualize variants in RNA-Seq and whole transcriptome analysis.

To learn more about our progress in 2009, visit our news page. It includes our press releases and reports in the news, publications citing our software, and webinars where we have presented our latest and greatest.

As we close 2009, we especially want to thank our customers and collaborators for their support in making the year successful and we look forward to an exciting year ahead in 2010.

Sunday, December 6, 2009

Expeditiously Exponential: Genome Standards in a New Era

One of the hot topics of 2009 has been the exponential growth in genomics and other data and how this growth will impact data use and sharing. The journal Science explored these issues in its policy forum in Oct. In early November, I discussed the first article, which was devoted to sharing data and data standards. The second article, listed under the category “Genomics,” focuses on how genomic standards need to evolve with new sequencing technologies.

Drafting By

The premise of the article “Genome Project Standards in a New Era of Sequencing” was to begin a conversation about how to define standards for sequence data quality in this new era of ultra-high throughput DNA sequencing. One of the “easy” things to do with Next Generation Sequencing (NGS) technologies is create draft genome sequences. A draft genomic sequence is defined as a collection of contig sequences that result from one, or a few, assemblies of large numbers of smaller DNA sequences called reads. In traditional Sanger sequencing a read was between 400 and 800 bases in length and came from a single clone, or sub-clone of a large DNA fragment. NGS reads, come from individual molecules in a DNA library and vary between 36 and 800 bases in length depending on the sequencing platform being used (454, Illumina, SOLiD, or Helicos).

A single NGS run can now produce enough data to create a draft assembly for many kinds of organisms with smaller genomes such as viruses, bacteria, and fungi. This makes it possible to create many draft genomes quickly and inexpensively. Indeed the article was accompanied by a figure showing that the current growth of draft sequences exceeds the growth of finished sequences by a significant amount. If this trend continues, the ratio of draft to finished sequences will grow exponentially into the foreseeable future.

Drafty Standards

The primary purpose for a complete genome sequence is to serve as a reference for other kinds of experiments. A well annotated reference is accompanied by a catalog of genes and their functions, as well as an ordering of the genes, regulatory regions, and the sequences needed for evolutionary comparisons that further elucidate genomic structure and function. A problem with draft sequences is that they can contain a large numbers of errors that range from incorrect base calls to more problematic mis-assemblies that place bases or groups of bases in the wrong order. Because, these holes leave some sequences are more drafty than others, they are less useful in fulfilling their purpose as reference data.

If we can describe the draftiness of a genome sequence we may be able to weight its fitness for a specific purpose. The article went on to tackle this problem by recommending a series of qualitative descriptions that describe levels of draft sequences. Beginning with the Standard Draft, or an assembly of contigs of unfiltered data from one or more sequencing platforms, the terms move through High-Quality Draft, to Improved High-Quality Draft, to Annotation-Directed Improvement, to Noncontiguous Finished, to Finished. Finished sequence is defined as less than 1 error per 100,000 bases and each genomic unit (chromosomes or plasmids that are capable of replication) is assembled into a single contig with a minimal number of exceptions. The individuals proposing these standards are a well respected group in the genome community and are working with the database groups and sequence ontology groups to incorporate these new descriptions into data submissions and annotations for data that may be used by others.

Given the high cost and lengthy time required to finish genomic sequences, finishing every genome to a high standard is impractical. If we are going to work with genomes that are finished to varying degrees, systematic ways to describe the quality of the data are needed . This policy recommendations are a good start, but more needs to be done to make the proposed standards useful.

First, standards need to be quantitative. Qualitative descriptions are less useful because they create downstream challenges when reference data are used in automated data processing and interpretation pipelines. As the numbers of available genomes grow into the thousands and tens of thousands, subjective standards make the data more and more cumbersome and difficult to review. Moreover without quantitative assessment, how will one know when they have an average error rate of 1 in 100,000 bases? The authors intentionally avoided recommending numeric thresholds in the proposed standards because the instrumentation and sequencing methodologies are changing rapidly. This may be true, but future discussions nevertheless should focus on quantitative descriptions for that very reason. It is because data collection methods and instrumentation are changing rapidly that we need measures we can compare. This is the new world.

Second, the article fails to address how the different standards might be applied in a practical sense. For example, what can I expect to do with a finished genome that I cannot do with a nearly finished genome? What is a standard draft useful for? How should I trust my results and what might I expect to do to verify a finding? While the article does a good job describing the quality attributes of the data that genome centers might produce, the proposed standards would have broader impact if they could more specifically set expectations of what could be done with data.

Without this understanding, we still won't know when when our data are good enough.

Sunday, November 22, 2009

Supercomputing 09

Teraflops, exaflops, exabytes, exascale, extreme, high dimensionality, 3D Internet, sustainability, green power, high performance computation, 400 Gbps networks, and massive storage were just some of the many buzz words seen and heard last week at the 21st annual supercomputing conference in Portland, Oregon.

Supercomputing technologies and applications are important to Geospiza. As biology becomes more data intensive, Geospiza follows the latest science and technology developments by attending conferences like supercomputing. This year, we participated in the conference through a "birds of a feather" session focused on sharing recent progress in the BioHDF project.

Each year the Supercomputing (SC) conference has focus areas called "thrusts." This year the thrusts were 3D Internet, Biocomputing, and Sustainability. Each day of the technical session started with a keynote presentation that focused on one of the thrusts. Highlights from the keynotes are discussed below.

First thrust: the 3D Internet

The technical program kicked off with a keynote from Justin Rattner, VP and CTO at Intel. In his address, Rattner discussed the business reality that high performance computing (HPC) is an $8 billion business with little annual growth (3% AGR). The primary sources for HPC funding are government and trickle up technology from PC sales. To break the dependence on government funding, Rattner suggested that HPC needs a "killer app" and suggested that the 3D Internet might just be that app. He went on to elaborate on the kinds of problems, such as continuously simulating environments, realistic animation, dynamic modeling and continuous perspectives, that are solved with HPC. Also, because immersive and collaborative virtual environments can be created, the 3D Internet provides a platform for enabling many kinds of novel work.

To illustrate, Rattner was joined by Aaron Duffy, a researcher at Utah State. Rather, Duffy’s avatar joined us as his presentation was in the Science SIM environment. Science SIM is a virtual reality system that is used to model environments and how they respond to change. For example, Utah State is studying how ferns respond to and propagate genetic changes in FernLand. Another example included how 3D modeling can save time and materials in fashion design.

Next, Rattner introduced how the current 3D Internet resembles the early days of the Internet when people struggled with the isolated networks of AOL, Prodigy and Compuserve. It wasn't until Tim Berners-Lee, and Marc Andreessen introduced the World Wide Web http protocol and Mosiac web browser, that the Internet had a platform on which to standardize. Similarly, the 3D Internet needs such a platform. Rattner introduced OpenSim as a possibility. In the OpenSim platform, extensible modules can be used to create different worlds. Because these worlds are built with a common infrastructure, users could have an avatar that could move between worlds, rather than have a new avatar for each world as they do today.

Second thrust: biocomputing

Leroy Hood kicked off the second day with a keynote on how supercomputing can be applied to systems biology and personalized medicine. Hood expects that within 10 years diagnostic assays will be characterized by billions of measurements. We will have two primary kinds of information feeding these assays: the digital data of the organism and data from the environment. The challenge is measuring how the environment affects the organism. To make this work we need to integrate biology, technology, and computers in better ways then we do today.

In terms of personalized medicine, Hood described different kinds of analyses and their purpose. For example, global analysis - such as sequencing a genome, measuring gene expression, or comprehensive protein analysis - creates catalogs. These catalogs then form the foundation for future data collection and analysis. The goal of such analysis is to create predictive actionable models. Biological data however, are noisy, and meaningful signals can be difficult to detect, so improving the signal to noise ratio requires the ability to integrate large volumes of multi-scalar data with diverse data types including biological knowledge. As the goal is to develop predictive actionable models we need supercomputers capable of dynamically quantifying information.

As an example, Hood presented work showing how disease states result in perturbations in regulated networks. In prion disease, the expression of many genes change over time as non-disease states move toward disease states. Interestingly, as disease progression is followed in mouse models, one can see expression levels change in genes that were not thought to be involved in prion disease. More importantly, these genes show expression changes before the physiological effects are observed. In other words, by observing gene expression patterns, one can detect a disease much earlier than they would by observing symptoms. Because diseases detected early are easier to treat, early detection can have beneficial consequences for reducing health care costs. However, measuring gene expression changes by observing changes in RNA levels is currently impractical. The logical next step is to see if gene expression can be measured by detecting changes in the levels of blood proteins. Of course, Hood and team are doing that too, and he showed data, from the prion model, that this is a feasible approach.

Using the above example, and others from whole genome sequencing, Hood painted a picture of future diagnostics where we will have our genomes sequenced at birth and each of us will have a profile created of organ specific proteins. In Hood's world, this would require 50 measurements from 50 organs. Blood protein profiles will be used as controls in regular diagnostic assays. In other assays, like cancer diagnostics, 1000’s of individual transcriptomes will be measured simultaneously in single assays. Similarly, 10,000 B-cells or T-cells could be sequenced to asssess immune states and diagnose autoimmune disorders. In the not too distant future, it will be possible to interrogate databases containing billions of data points from 100's of millions of patients.

With these possibilities on the horizon, there are a number of challenges that must be overcome. Data space is infinite, so queries must be constructed carefully. The data that need to be analyzed have high dimensionality, so we need new ways to work with these data. Finally multi-scale datasets must be integrated together and data analysis systems must be interoperable. Meeting these final challenges requires that standards for working with data be developed and adopted. Finally, Hood made the point that groups like his can solve some of the scientific issues related to computation, but not the infrastructure issues that must also be solved to make the vision a reality.

Fortunately, Geospiza is investigating technologies to meet current and future biocomputing challenges through the company’s product development and standards initiatives like the BioHDF project.

Third thrust: sustainability

Al Gore gave the third day’s keynote address and much of his talk addressed climate change. Gore reminded us that 400 years ago, Galileo collected the data that supported Copernicus’ theory that the earth’s rotation creates the illusion of the sun moving across the sky. He went on to explain how Copernicus reasoned that the illusion is created because the sun is so far away. Gore also explained how difficult it was for people of Copernicus', or Galileo’s, time to accept that the universe does not rotate around the earth.

Similarly, when we look into the sky we see an expansive atmosphere that seems to go on for ever. Pictures from space however tell a different story. Those pictures show us that our atmosphere is a thin band, only 1/1000 the size of the earth’s volume. The finite volume our atmosphere explains how we can change our climate when we pump billions of tons of CO[2] into the atmosphere as we are doing now. It is also hard for many conceptualize that the CO[2] is affecting the climate when they do not see or feel direct or immediate effects. Gore added the interesting connections that the first oil well, drilled by “Colonel” Edwin Drake in Pennsylvania, and discovery, by John Tyndall, that CO[2] absorbs infrared radiation both occurred in 1859. 150 years ago we not only had the means to create climate change, but understood how it would work.

Gore outlined a number of ways in which supercomputing and the supercomputing community can help with global warming. Climate modeling and climate prediction are two important areas where supercomputers are used. Conference presentations and and demonstrations on the exhibit floor made this clear. Less obvious applications involve modeling new electrical grids and more efficient modes of transportation. Many of the things we rely on daily are based on infrastructures that are close 100 years old. From our internal combustion engines to our centralized electrical systems, inefficiency can be measured in billions of dollars that are lost annually to system failures or energy consumption that is not effective.

Gore went on to remind us that Moore’s law is a law of self-fulfilling expectations. When first proposed, it was a recognition of design and implementation capabilities with an eye to the future. Moore’s law worked because R&D funding was established to stay on track. We now have an estimated one billion transistors for every person on the planet. If we commit similar efforts to improving energy efficiency in ways analogous to Moore’s law, we can create a new self fulfilling paradigm. The benefits of such a commitment would be significant. As Gore pointed out, our energy, climate, and economic crises are intertwined. Much of our national policy is in reaction to oil production disruption or the threat of disruption, and the costs of our policies are significant.

In closing, Gore stated that supercomputing is the most powerful technology we have today and represents the third form of knowledge creation. The first two being inductive and deductive reasoning. With supercomputers we can collect massive amounts of data, develop models and use simulation to develop predictive and testable hypotheses. Gore noted that humans have a low bit rate, but high resolution. This means that while our ability to absorb data is slow, we are very good at recognizing patterns. Thus computers, with their ability to store and organize data, can be programmed to convert data into information and display information in new ways to give us new insights for solutions to the most vexing problems.

This last point resonated through all three keynotes. Computers are giving us new ways to work with data and understand problems; they are also providing new ways to share information and communicate with each other.

Geospiza is keenly aware of this potential and a significant focus of our research and development is directed toward solving data analysis, visualization, and data sharing problems in genomics and genetic analysis. In the area of Next Generation Sequencing (NGS), we have been developing new ways to organize and visualize the information contained in NGS datasets to easily spot patterns amidst the noise.

Sunday, November 8, 2009

Expeditiously Exponential: Data Sharing and Standardization

We can all agree that our ability to produce genomics and other kinds of data is increasing at exponential rates. Less clear, is understanding the consequences for how these data will be shared and ultimately used. These topics were explored in last month's (Oct. 9, 2009) policy forum feature in the journal Science.

The first article, listed under the category "megascience," dealt with issues about sharing 'omics data. The challenge being that systems biology research demands that data from many kinds of instrument platforms (DNA sequencing, mass spectrometry, flow cytometry, microscopy, and others) be combined in different ways to produce a complete picture of a biological system. Today, each platform generates its own kind of "big" data that, to be useful, must be computationally processed and transformed into standard outputs. Moreover, the data are often collected by different research groups focused on particular aspects of a common problem. Hence, the full utility of the data being produced can only be realized when the data are made open and shared throughout the scientific community. The article listed past efforts in developing sharing policies and the central table included 12 data sharing policies that are already in effect.

Sharing data solves half of the problem, the other aspect is being able to use the data once shared. This requires that data be structured and annotated in ways that make it understandable by a wide range of research groups. Such standards typically include minimum information check lists that define specific annotations, and which data should be kept from different platforms. The data and metadata are stored in structured documents that reflect a community's view about what is important to know with respect to how data were collected and the samples the data were collected from. The problem is that annotation standards are developed by diverse groups and, like the data, are expanding. This expansion creates new challenges with making data interoperable; the very problem standards try to address.

The article closed with high-level recommendations for enforcing policy through funding and publication requirements and acknowledged that full compliance requires that general concerns with pre-publication data use and patient information be addressed. More importantly, the article acknowledged that meeting data sharing and formatting standards has economic implications. That is, researches need time-efficient data management systems, the right kinds of tools and informatics expertise to meet standards. We also need to develop the right kind of global infrastructure to support data sharing.

Fortunately complying with data standards is an area where Geospiza can help. First, our software systems rely on open, scientifically valid tools and technologies. In DNA sequencing we support community developed alignment algorithms. The statistical analysis tools in GeneSifter Analysis Edition utilize R and BioConductor to compare gene expression data from both microarrays and DNA sequencing. Further, we participate in the community by contributing additional open-source tools and standards through efforts like the BioHDF project. Second, the GeneSifter Analysis and Laboratory platforms provide the time-effiecient data management solutions needed to move data through its complete life cycle from collection, to intermediate analysis, to publishing files in standard formats.

GeneSifter lowers researcher's economic barriers of meeting data sharing and annotation standards keep the focus on doing good science with the data.

Sunday, November 1, 2009

GeneSifter Laboratory Edition Update

GeneSifter Laboratory Edition has been updated to version 3.13. This release has many new features and improvements that further enhance its ability to support all forms of DNA sequencing and microarray sample processing and data collection.

Geospiza Products

Geospiza's two primary products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), form a complete software system that supports many kinds of genomics and genetic analysis applications. GSLE is the LIMS (Laboratory Information Management System) that is used by core labs and service companies worldwide that offer DNA sequencing (Sanger and Next Generation), microarray analysis, fragment analysis and other forms of genotyping. GSAE is the analysis system researchers use to analyze their data and make discoveries. Both products are actively updated to keep current with latest science and technological advances.

The new release of GSLE helps labs share workflows, perform barcode-based searching, view new data reports, simplify invoicing, and automate data entry through a new API (application programming interface).

Sharing Workflows

GSLE laboratory workflows make it possible for labs to define and track their protocols and data that are collected when samples are processed. Each step in a protocol can be configured to collect any kind of data, like OD values, bead counts, gel images and comments, that are used to record sample quality. In earlier versions, protocols could be downloaded as PDF files that list the steps and their data. With 3.13, a complete workflow (steps, rules, custom data) can be downloaded as an XML file that can be uploaded into another GSLE system to recreate the entire protocol with just a few clicks. This feature simplifies protocol sharing and makes it possible for labs to test procedures in one system and add them to another when they are ready for production.

Barcode Searching and Sample Organization

Sometimes a lab needs to organize separate tubes in 96-well racks for sample preparation. Assigning each tube's rack location can be an arduous process. However, if the tubes are labeled with barcode identifiers, a bed scanner can be used to make the assignments. GSLE 3.13 provides an interface to upload bed scanner data and assign tube locations in a single step. Also, new search capabilities have been added to find orders in the system using sample or primer identifiers. For example, orders can be retrieved by scanning a barcode from a tube in the search interface.

Reports and Data

Throughout GSLE, many details about data can be reviewed using predefined reports. In some cases, pages can be quite long, but only a portion of the report is interesting. GSLE now lets you collapse sections of report pages to focus on specific details. New download features have also been added to better support access to those very large NGS data files.

GSLE has always been good at identifying duplicate data in the system, but not always as good at letting you decide how duplicate data are managed. Managing duplicate data is now more flexible to better support situations where data need to be reanalyzed and reloaded.

The GSLE data model makes it possible to query the database using SQL. In 3.13, the view tables interface has been expanded so that the data stored in each table can be reviewed with a single click.


Core lab's that send invoices will benefit from changes that make it possible to download many PDF formatted orders and invoices into a single zipped folder. Configurable automation capabilities have also been added to set invoice due dates and generate multiple invoices from a set of completed orders.

API Tools

As automation and system integration needs increase, external programs are used to enter data from other systems. GSLE 3.13 supports automated data entry through a novel self-documenting API. The API takes advantage of GSLE's built in data validation features that are used by the system's web-based forms. At each site, the API can be turned on and off by on-site administrators and its access can be limited to specific users. This way, all system transactions are easily tracked using existing GLSE logging capabilities. In addition to data validation and access control, the API is self-documenting. Each API containing form has a header that includes key codes, example documentation, and features to view and manually upload formatted data to test automation programs and help system integrators get their work done. GSLE 3.13 further supports enterprise environments with an improved API that is used to query external password authentication servers.

Friday, October 23, 2009

Yardsticks and Sequencers

A recent question to the ABRF discussion forum, about quality values and Helicos data, led to an interesting conversation about having yardsticks to compare between Next Generation Sequencing (NGS) platforms and the common assays that are run on those platforms.

It also got me thinking, just how well can you measure things with those free wooden yardsticks you get at hardware stores and home shows?


The conversation started with a question asking about what kind of quality scoring system could be applied to Helicos data. Could something similar to Phred and AB files be used?

A couple of answers were provided. One referred to the recent Helicos article in Nature Biotechnology and pointed out that Helicos has such a method. This answer also addressed the issue that quality values (QVs) need to be tuned for each kind of instrument.

Another answer, from a core lab director with a Helcos instrument, pointed out many more challenges that exist with comparing data from different applications and how software in this area is lacking. He used the metaphor of the yardstick to make the point that researchers need systematic tools and methods to compare data and platforms.

What's in a Yardstick?

I replied to the thread noting that we've been working with data from 454, Illumina GA, SOLiD and Helicos and there are multiple issues that need to be addressed in developing yardsticks to compare data from different instruments for different experiments (or applications).

At one level, there is the instrument and the data that are produced and the question is can have a standard quality measure? In Phred, we need to recall that each instrument needed to be calibrated so that quality values would be useful and equivalent across chemistries and platforms (primers, terminators, bigdye, gel, cap, AB models, MegaBACE ...). Remember phredpar.dat? Because the data were of a common type - an electropherogram - we could more or less use a single tool and define a standard. Even then, other tools (LifeTrace, KB basecaller, and LongTrace) emerged and computed standardized quality values differently. So, I would argue that we think we have a measure, but it is not the standard we think it is.

By analogy, each NGS instrument uses a very different method to generate sequences, so each platform will have a unique error profile. The good news is that quality values, as transformed error probabilities, make it possible to compare output from different instruments in terms of confidence. The bad news is that if you do not know how the error probability is computed, or you do not have enough data (control, test) to calibrate the system, error probabilities are not useful. Add to that, the fact that the platforms are undergoing rapid change as they improve chemistry, change hardware and software to increase throughput and accuracy. So, for the time being we might have yardsticks, but they have variable lengths.

The next levels deal with experiments. As noted ChiP-Seq, RNA-Seq, Me-Seq, Re-Seq, and your favorite-Seq all measure different things and we are just learning about how errors and other artifacts interfere with how well the data produced actually measure what the experiment intended to measure. Experiment level methods need to be developed so that ChiP-Seq from one platform can be compared to ChiP-Seq from another platform and so on. However, the situation is not dire because in the end, DNA sequences are the final output and for many purposes the data produced are much better now then they have been in the past. As we push sensitivity, the issues already discussed become very relevant.

As a last point, the goal many researchers will have is to layer data from on experiment on another experiment, correlate ChIP-Seq with RNA-Seq for example and to do that you not only need to have quality measures for data, sample, experiment, you also need ways to integrate all of this experimental information with already published data. There is a significant software challenge ahead and, as pointed out, cobbling solutions together is not a long term feasible answer. The datasets are getting to big and complex and at the same time the archives are busting with data generated by others.

So what does this have to do with yardsticks?

Back to yardsticks. Those cheap wooden yardstick expand and contract with temperature and humidity, so at different times a yardstick's measurements will change. This change is the uncertainty of the measurement (see additional reading below), which defines the precision of our measuring device. If I want a quick estimate of how tall my dog stands, I would happily use the wooden yardstick. However, if I want to measure something to within a 32nd of an inch or millimeter, I would use a different tool. The same rules apply to DNA sequencing, for many purposes the reads are good enough and data redundancy overcomes errors, but as we push sensitivity and want to measure changes in fewer molecules, discussions about how to compute QVs and annotate data, so that we know which measuring device was used, become very important.

Finally, I often see in the literature, company brochures, and hear in conversation that refer to QVs as Phred scores. Remember: Only Phred makes Phred QVs - everything else is Phred-like, but only if it is a -10log(P) transformation of an error probability.

Additional Reading:

Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Monday, October 19, 2009

Sneak Peak: Gene Expression Profiling in Cancer Research: Microarrays, Tag Profiling and Whole Transcriptome Analysis

Join us this Wednesday (October 21, 2009 10:00 am Pacific Daylight Time) to learn about how GeneSifter is used to measure transcript abundance as well as discover novel transcripts and isoforms of expressed genes in cancer.


Current gene expression technologies such as Microarrays and Next Generation Sequencing applications allow biomedical researchers to examine the expression of tens of thousands of genes at once, giving researchers the opportunity to examine expression for an entire genome, where previously they could only look at a handful of genes at one time.

In addition, NGS applications such as Tag Profiling and Whole Transcriptome Analysis can identify novel transcripts and characterize both known and novel splice junctions. These applications allow characterization of the cancer transcriptome at an unprecedented level.

This presentation will provide an overview of the gene expression data analysis process for these applications with an emphasis on identification of differentially expressed genes, identification of novel transcripts and characterization of alternative splicing. Using data drawn from the GEO data repository and the Short Read Archive gene expression in Melanoma will be examined using Microarrays, NGS Tag Profiling and NGS Whole Transcriptome Analysis data.

Tuesday, October 13, 2009

Super Computing 09 and BioHDF

Next month, Nov 16-20, we will be in Portland for Super Computing 09 - SC09. Join us at a Birds of a Feather (BoF) session to learn about developing bioinformatics applications with BioHDF. The session will be Wed. Nov 18 at 12:15 pm in room D139-140.

Developing Bioinformatics Applications with BioHDF

In this session we will present how HDF5 can be used to work with large volumes of DNA sequence data. We will cover the current state of bioinformatics tools that utilize HDF5 and proposed extensions to the HDF5 library to create BioHDF. The session will include a discussion of requirements that are being be considered to develop a data models for working with DNA sequence alignments to measure variation within sets of DNA sequences.

HDF5 is an open-source technology suite for managing diverse, complex, high-volume data in heterogeneous computing and storage environments. The BioHDF project is investigating the use of HDF5 for working with very large scientific datasets. HDF5 provides a hierarchical data model, binary file format, and collection of APIs supporting data access. BioHDF will extend HDF5 to support DNA sequencing requirements.

Initial prototyping of BioHDF has demonstrated clear benefits. Data can be compressed and indexed in BioHDF to reduce storage needs and enable very rapid (typically, few millisecond) random access into these sequence and alignment datasets, essentially independent of the overall HDF5 file size. Additional prototyping activities we have identified key architectural elements and tools that will form BioHDF.

The BoF session will include a presentation of the current state of BioHDF and proposed implementations to encourage discussion of future directions.

Thursday, October 8, 2009

Resequencing and Cancer

Yesterday we released news about new funding from NIH for a project to work on ways to improve how variations between DNA sequences are detected using Next Generation Sequencing (NGS) technology. The project emphasizes detecting rare variation events to improve cancer diagnostics, but the work will support a diverse range of resequencing applications.

Why is this important?

In October 2008, the U.S. News and World Report published an article by Bernadine Healy, former head of NIH. The tag line “understanding the genetic underpinnings of cancer is a giant step toward personalized medicine,” (1) underscores how the popular press views the promise of recent advances in genomics technology in general, and the progress toward understanding the molecular basis of cancer. In the article, Healy presents a scenario where, in 2040, a 45-year-old woman, who has never smoked, develops lung cancer. She undergoes outpatient surgery, and her doctors quickly scrutinize the tumor’s genes and use a desktop computer to analyze the tumor genomes, and medical records to create a treatment plan. She is treated, the cancer recedes and subsequent checkups are conducted to monitor tumor recurrence. Should a tumor be detected, her doctor would quickly analyze the DNA of a few of the shed tumor cells and prescribe a suitable next round of therapy. The patient lives a long happy life, and keeps her hair.

This vision of successive treatments based on genomic information is not unrealistic, claims Healy, because we have learned that while many cancers can look homogeneous in terms of morphology and malignancy they are indeed highly complex and varied when examined at the genetic level. The disease of cancer is in reality a collection of heterogeneous diseases that, even for common tissues like the prostate, can vary significantly in terms of onset and severity. Thus, it is often the case that cancer treatments, based on tissue type, fail, leaving patients to undergo a long painful process of trial and error therapies with multiple severely toxic compounds.

Because cancer is a disease of genomic alterations, understanding the sources, causes, and kinds of mutations, and their connection to specific types of cancer, and how they may predict tumor growth is worthwhile. The human cancer genome project (2) and initiatives like the international cancer genome consortium (3) have demostrated this concept. The kinds of mutations found in tumor populations, thus far by NGS, include single nucleotide polymorphisms (SNPs), insertions and deletions, and small structural copy number variations (CNVs) (4, 5). From early studies it is clear that a greater amount of genomic information will be needed to make Healy's scenario a reality. Next generation sequencing (NGS) technologies will drive this next phase of research and enable our deeper understanding.

Project Synopsis

The great potential for the clinical applications of new DNA sequencing technologies comes from their highly sensitive ability to assess genetic variation. However, to make these technologies clinically feasible, we must assay patient samples at far higher rates than can be done with current NGS procedures. Today, the experiments applying NGS, in cancer research have investigated small numbers of samples in great detail, in some cases comparing entire genomes from tumor and normal cells from a single patient (6-8). These experiments show, that when a region is sequenced with sufficient coverage, numerous mutations can be identified.

To move NGS technologies into clinical use many costs must decrease. Two ways costs can be lowered are to increase sample density and reduce the number of reads needed per sample. Because cost is a function of turnaround time and read coverage, and read coverage is a function of the signal to noise ratio, assays with a higher background noise, due to errors in the data, will require higher sampling rates to detect true variation and be more expensive. To put this in context, future cancer diagnostic assays will likley need to look at over 4000 exons per test. In cases like bladder cancer, or cancers where stool or blood are sampled, non-invasive tests will need to detect variations in one out of 1000 cells. Thus it is extremely important that we understand signal/noise ratios and to be able to calculate read depth in a reliable fashion.

Currently we have a limited understanding of how many reads are needed to detect a given rare mutation. Detecting mutations depends on a combination of sequencing accuracy and depth of coverage. The signal (true mutations) to noise (false mutations, hidden mutations) depends on how many times we see a correct result. Sequencing accuracy is affected by multiple factors that include sample preparation, sequence context, sequencing chemistry, instrument accuracy, and basecalling software. The current depth-of-coverage calculations are based on an assumption that sampling is random, which is not valid in the real world. Corrections will have to be applied to adjust for real-world sampling biases that affect read recovery and sequencing error rates (9-11).

Developing clinical software systems that can work with NGS technologies, to quickly and accurately detect rare mutations, requires a deep understanding of the factors that affect the NGS data collection and interpretation. This information needs to be integrated into decision control systems that can, through a combination of computation and graphical displays, automate and aid a clinician’s ability to verify and validate results. Developing such systems are major undertakings involving a combination of research and development in the areas of laboratory experimentation, computational biology, and software development.

Positioned for Success

Detecting small genetic changes in clinical samples is ambitious. Fortunately, Geospiza has the right products to deliver on the goals of the research. GeneSifter Lab Edition handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. The data management and analysis server make the system scalable through a distributed architecture. Combined with GeneSifter Analysis Edition, a complete platform is created to rapidly prototype new data analysis workflows needed to test new analysis methods, experiment with new data representations, and iteratively develop data models to integrate results with experimental details.


Press Release: Geospiza Awarded SBIR Grant for Software Systems for Detecting Rare Mutations

1. Healy, 2008. "Breaking Cancer's Gene Code - US News and World Report"

2. Working Group, 2005. "Recommendation for a Human Cancer Genome Project"

3. ICGC, 2008. "International Cancer Genome Consortium - Goals, Structure, Policies &Guidelines - April 2008"

4. Jones S., et. al., 2008. "Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses." Science 321, 1801.

5. Parsons D.W., et. al., 2008. "An Integrated Genomic Analysis of Human Glioblastoma Multiforme." Science 321, 1807.

6. Campbell P.J., et. al., 2008. "Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing." Proc Natl Acad Sci U S A 105, 13081-13086.

7. Greenman C., et. al., 2007. "Patterns of somatic mutation in human cancer genomes." Nature 446, 153-158.

8. Ley T.J., et. al., 2008. "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome." Nature 456, 66-72.

9. Craig D.W., et. al., 2008. "Identification of genetic variants using bar-coded multiplexed sequencing." Nat Methods 5, 887-893.

10. Ennis P.D.,et. al., 1990. "Rapid cloning of HLA-A,B cDNA by using the polymerase chain reaction: frequency and nature of errors produced in amplification." Proc Natl Acad Sci U S A 87, 2833-2837.

11. Reiss J., et. al., 1990. "The effect of replication errors on the mismatch analysis of PCR-amplified DNA." Nucleic Acids Res 18, 973-978.

Tuesday, October 6, 2009

From Blue Gene to Blue Genome? Big Blue Jumps In with DNA Transistors

Today, IBM announced that are getting into the DNA sequencing business and race for the $1,000 dollar genome by winning a research grant to explore new sequencing technology based on nanopore devices they call DNA transistors.

IBM news travels fast. Genome Web and The Daily Scan covered the high points and Genetic Future presented a skeptical analysis of the news. You can read the original news at the IBM site, and enjoy a couple of videos.

A NY Times article a listed a couple of facts that I thought were interesting: First, IBM becomes the 17th company to pursue the next next-gen (or third-generation) technology. Second, according to George Church, in the past five years the cost of collecting DNA sequence data has decreased by 10 fold annually and is expected to continue decreasing at a similar pace for the next few years.

But what does this all mean?

It is clear from this and other news that DNA sequencing is fast becoming a standard way to study genomes, gene expression, and measure genetic variation. It is also clear the while the cost of DNA sequencing is decreasing at a fast rate, the amount of data being produced is increasing at a similarly fast rate.

While some of the articles above discussed the technical hurdles nanopore sequencing must overcome, none discussed the real challenges researchers face today with using the data. The fact is, for most groups, the current next-gen sequencers are under utilized because the volumes of data combined with the complexity of data analysis has created a significant bioinformatics bottleneck.

Fortunately, Geospiza is clearing data analysis barriers by delivering access to systems that provide standard ways of working with the data and visualizing results. For many NGS applications, groups can upload their data to our servers, align reads to reference data sources, and compare the resulting output across multiple samples in efficient and cost effective processes.

And, because we are meeting the data analysis challenges for all of the current NGS platforms, we'll be ready for whatever comes next.

Wednesday, September 23, 2009

GeneSifter in Current Protocols

This month we are pleased to report Geospiza's publication of the first standard protocols for analyzing Next Generation Sequencing (NGS) data. The pulication, appearing in the September issue of Current Protocols, addresses how to analyze data from both microarray, and NGS experiments. The abstract and links to the paper and our press release are provided below.


Transcription profiling with microarrays has become a standard procedure for comparing the levels of gene expression between pairs of samples, or multiple samples following different experimental treatments. New technologies, collectively known as next-generation DNA sequencing methods, are also starting to be used for transcriptome analysis. These technologies, with their low background, large capacity for data collection, and dynamic range, provide a powerful and complementary tool to the assays that formerly relied on microarrays. In this chapter, we describe two protocols for working with microarray data from pairs of samples and samples treated with multiple conditions, and discuss alternative protocols for carrying out similar analyses with next-generation DNA sequencing data from two different instrument platforms (Illumina GA and Applied Biosystems SOLiD).

In the chapter we cover the following protocols:
  • Basic Protocol 1: Comparing Gene Expression from Paired Sample Data Obtained from Microarray Experiments
  • Alternate Protocol 1: Compare Gene Expression from Paired Samples Obtained from Transcriptome Profiling Assays by Next-Generation DNA Sequencing
  • Basic Protocol 2: Comparing Gene Expression from Microarray Experiments with Multiple Conditions
  • Alternate Protocol 2: Compare Gene Expression from Next-Generation DNA Sequencing Data Obtained from Multiple Conditions


To view the abstract, contents, figures, and literature cited online visit: Curr. Protoc. Bioinform. 27:7.14.1-7.14.34

To view the press release visit: Geospiza Team Publishes First Standard Protocol for Next Gen Data Analysis

Saturday, September 12, 2009

Sneak Peak: Sequencing the Transcriptome: RNA Applications for Next Generation Sequencing

Join us this coming Wednesday, September 16, 2009 10:00 am Pacific Daylight Time (San Francisco, GMT-07:00), for a webinar on whole transcriptome analysis. In the presentation you will learn about how GeneSifter Analysis Edition can be used to identify novel RNAs and novel splice events within known RNAs.


Next Generation Sequencing applications such as RNA-Seq, Tag Profiling, Whole Transcriptome Sequencing and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA applications using data from the NCBI's GEO database and Short Read Archive with an emphasis on converting raw data into biologically meaningful datasets. Data analysis examples will focus on methods for identifying differentially expressed genes, novel genes, differential splicing and 5’ and 3’ variation in miRNAs.

To register, please visit the event page.

Sunday, September 6, 2009

Open or Closed

A key aspect of Geospiza’s software development and design strategy is to incorporate open scientific technologies into the GeneSifter products to deliver user friendly access to best-of-breed tools used to manage and analyze genetic data from DNA sequencing, microarray, and other experiments.

Open scientific technologies include open-source and published academic algorithms, programs, databases, and core infrastructure software such as operating systems, web servers, and other components needed to build modern systems for data management. Unlike approaches that rely on proprietary software, Geospiza’s adoption of open platforms and participation in the open-source community benefits our customers in numerous ways.

Geospiza’s Open Source History

When Geospiza began in 1997, the company started building software systems to support DNA sequencing technologies and applications. Our first products focused on web-enabled data management for DNA sequencing-based genomics applications. Foundational infrastructure, such as the web-server, and application layer incorporated Apache and Perl. We were also leaders, in that our first systems operated on Linux, an open-source UNIX-based operating system. In those early days, however, we used proprietary databases such as Solid and Oracle because the open-source alternatives Postgres and MySQL were still lacking features needed to support robust data processing environments. As these products matured, we extended our application support to include Postgres to deliver cost-effective solutions for our customers. By adopting such open platforms we were able to deliver robust, high performing systems, rapidly at a reasonable cost.

In addition to using open-source technology as the foundation of our infrastructure, we also worked with open tools to deliver our scientific applications. Our first product, the Finch Blast-Server, utilized the public domain BLAST from NCBI. Where possible, we sought to include well-adopted tools for other applications such as base calling and sequence assembly and repeat masking, for which the source code was made available. We favored these kinds tools over developing our own proprietary tools, because it was clear that technologies emerging from communities like the genome centers would advance much quicker and be better tuned to the problems people were trying to address. Further, these tools, because of their wide adoption within their community and publication, received higher levels of scrutiny and validation than their proprietary counterparts.

Times Change

In the early days, many of the genome center tools were licensed by universities. As the bioinformatics field matured, open-source models for delivering bioinformatics software have become more popular. Led by NCBI and pioneered by organizations like TIGR (now JCVI) and the Sanger institute, the majority of useful bioinformatics programs are now being delivered open-source either under GPL, BSD like, or Perl Artistic style licenses ( The authors of these programs have benefited from wider adoption of their programs and continued support from funding agencies like NIH. In some cases other groups are extending best-of-breed technologies into new applications.

A significant benefit of the increasing use of open-source licensing is that a large number of analytical tools are readily available for many kinds of applications. Today we have robust statistical platforms like R and BioConductor and several algorithms for aligning Next Gen Sequencing (NGS) data. Because these platforms and tools are truly open-source, bioinformatics groups can easily access these technologies to understand how they work and compare other approaches to their own. This creates a competitive environment for bioinformatics tool providers that drives improvements in algorithm performance and accuracy and the research community benefits greatly.

Design Choices

Early on, Geospiza recognized value incorporating tools from the academic research community into our user friendly software systems. Such tools were being developed in close collaboration with the data production centers that were trying to solve scientific problems associated with DNA sequence assembly and analysis. Companies developing proprietary tools designed to compete with these efforts were at a disadvantage, because they did not have real time access to conversations between biologists, lab specialists, and mathematicians needed to quickly develop the deep experience of working with biologically complex data. This disadvantage continues today. Further, the closed nature of proprietary software limits the ability to publish work and have critical peer review of the code needed to ensure scientific validation.

Our work could proceed more quickly because we did not have to invest in solving the research problems associated with developing algorithms. Moreover, we did not have to invest in proving the scientific credibility of an algorithm. Instead we could cite published references and keep our focus on solving problems associated delivering the user interfaces needed to work with the data. Our customers benefited by gaining easy access to best-of-breed tools and having the knowledge that they had a community to draw on to understand their scientific basis.

Geospiza continues its practice of adopting open best-of-breed technologies. Our NGS systems utilize multiple tools such as MAQ, BWA, Bowtie, MapReads and others. GeneSifter Analysis Edition utilizes routines from the R and BioConductor package to perform statistical computations to compare datasets from microarray and NGS experiments. In addition, we are addressing issues related to high performance computing through our collaboration with the HDF Group and the BioHDF project. In this case we are not only adopting open-source technology, but also working with leaders in the field to make open-source contributions of our own.

When you use Geospiza’s GeneSifter products you can be assured that you are using the same tools as the leaders in our fields to receive the benefits of reducing data analysis costs combined with the advantages of community support through forums and peer reviewed literature.

Sunday, August 16, 2009

BioHDF on the Web

During the past spring and early part of summer, we presented our initial work using HDF5 technology to make next generation DNA sequencing data management scalable. The presentations are posted on web, along with other points of interest that are listed below.

Presentations by Mark Welsh, and myself can be found at SciVee.
Mark presents our poster at ISMB , and I present our work at the “Sequencing, Finishing and Analysis in the Future Meeting,” in Santa Fe.

We also presented at this and last year’s BOSC meetings that were held at ISMB. The abstracts and slides can be found at:

What others are thinking:
Real time commentary on the 2009 BOSC presentation can be found at friendfeed and another post. The Fisheye Perspective considers how HDF5 fits with semantic web tools.

HDF in Bioinformatics:
Check out Fast5 for using HDF5 to store sequences and base probablities.

BioHDF in the News:
Genome Web and Bioinform articles on HDF5 or referencing HDF5 include:

Links to FinchTalks about BioHDF from 2008 t0 present include:
Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. HDF Software :
You can learn more about HDF5 and get the software and tools at:

Sunday, July 26, 2009

Researchers get hands-on practice using GeneSifter to analyze Next Gen data and SNPs at the UT-ORNL-KBRIN 2009 summit

As the recently-published meeting report attests (1), while the much of the genomics world was bemoaning the challenge of working with Next Generation DNA sequence (NGS) data, attendees at the UT-ORNL-KBRIN summit got an opportunity to play with the data first-hand. These lucky researchers, among the first in the nation to work with NGS data outside of a genome center, were participants in a hands-on education workshop presented by Dr. Sandra Porter from Digital World Biology.

This opportunity came about through a collaboration between Geospiza and Digital World Biology. Data were loaded and processed through GeneSifter’s analysis pipelines before the workshop to align data to reference sequences and perform the secondary analyses. During the workshop, Porter led researchers through typical steps in the tertiary analysis phase. Workshop participants were able to view gene lists and analyze information from both Illumina and SOLiD datasets. The analyses included: working with thumbnail graphics to see where reads map to transcripts and assess coverage, viewing the number of reads mapping to each transcript, and comparing the number of reads mapping to different genes under different conditions to investigate gene expression.

An additional workshop focused on using iFinch and FinchTV to view SNP data in chromatogram files generated by Sanger sequencing and working with structures to see how a single nucleotide change can impact the structure of a protein.

1. Eric Rouchka and Julia Krushkal. 2009. Proceedings of the Eighth Annual UT-ORNL-KBRIN Bioinformatics Summit 2009. BMC Bioinformatics 10(Suppl 7):I1doi:10.1186/1471-2105-10-S7-I1.

Sunday, July 12, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part V: Why HDF5?

Through the course of this BioHDF bloginar series, we have demonstrated how the HDF5 (Hierarchical Data Format) platform can successfully meet current and future data management challenges posed by Next Generation Sequencing (NGS) technologies. We now close the series by discussing the reasons why we chose HDF5.

For previous posts, see:

  1. The introduction
  2. Project background
  3. Challenges of working with NGS data
  4. HDF5 benefits for working with NGS data

Why HDF5?

As previously discussed, HDF technology is designed for working with large amounts of complex data that naturally organize into multidimensional arrays. These data are composed of discrete numeric values, strings of characters, images, documents, and other kinds of data that must be compared in different ways to extract scientific information and meaning. Software applications that work with such data must meet a variety of organization, computation, and performance requirements to support the communities of researchers where they are used.

When software developers build applications for the scientific community, they must decide between creating new file formats and software tools for working with the data, or adapting existing solutions that already meet a general set of requirements. The advantage of developing software that's specific to an application domain is that highly optimized systems can be created. However, this advantage can disappear when significant amounts of development time are needed to deal with the "low-level" functions of structuring files, indexing data, tracking bits and bytes, making the system portable across different computer architectures, and creating a basic set of tools to work with the data. Moreover, such a system would be unique, with only a small set of users and developers able to understand and share knowledge concerning its use.

The alternative to building a highly optimized domain-specific application system is to find and adapt existing technologies, with a preference for those that are widely used. Such systems benefit from the insights and perspective of many users and will often have features in place before one even realizes they are needed. If a technology has widespread adoption, there will likely be a support group and knowledge base to learn from. Finally, it is best to choose a solution that has been tested by time. Longevity is a good measure of the robustness of the various parts and tools in the system.

HDF: 20 Years in Physical Sciences

Our requirements for high-performance data management and computation system are these:

  1. Different kinds of data need to be stored and accessed.
  2. The system must be able to organize data in different ways.
  3. Data will be stored in different combinations.
  4. Visualization and computational tools will access data quickly and randomly.
  5. Data storage must be scalable, efficient, and portable across computer platforms.
  6. The data model must be self describing and accessible to software tools.
  7. Software used to work with the data must be robust, and widely used.

HDF5 is a natural fit. The file format and software libraries are used in some of the largest data management projects known to date. Because of its strengths, HDF5 is independently finding its way into other bioinformatics applications and is a good choice for developing software to support NGS.

HDF5 software provides a common infrastructure that allows different scientific communities to build specific tools and applications. Applications using HDF5 typically contain three parts: one or more HDF5 files to store data, a library of software routines to access the data, and the tools, applications and additional libraries to carry out functions that are specific to a particular domain. To implement an HDF5-based application, a data model be developed along with application specific tools such as user interfaces and unique visualizations. While implementation can be a lot of work in its own right, the tools to implement the model and provide scalable, high-performance programmatic access to the data have already been developed, debugged, and delivered through the HDF I/O (input/output) library.

In earlier posts, we presented examples where we needed to write software to parse fasta formatted sequence files and output files from alignment programs. These parsers then called routines in the HDF I/O library to add data to the HDF5 file. During the import phase, we could set different compression levels and define the chunk size to compress our data and optimize access times. In these cases, we developed a simple data model based on the alignment output from programs like BWA, Bowtie, and MapReads. Most importantly, we were able to work with NGS data from multiple platforms efficiently, with software that required weeks of development rather than the months and years that would be needed if the system was built from scratch.

While HDF5 technology is powerful "out-of-the-box," a number of features can still be added to make it better for bioinformatics applications. The BioHDF project is about making such domain-specific extensions. These are expected to include modifications to the general file format to better support variable data like DNA sequences. I/O library extensions will be created to help HDF5 "speak" bioinformatics by creating APIs (Application Programming Interfaces) that understand our data. Finally, sets of command line programs and other tools will be created to help bioinformatics groups get started quickly with using the technology.

To summarize, the HDF5 platform is well-suited for supporting NGS data management and analysis applications. Using this technology, groups will be able to make their data more portable for sharing because the data model and data storage are separated from the implementation of the model in the application system. HDF5's flexibility for the kinds of data it can store, makes it easier to integrate data from a wide variety of sources. Integrated compression utilities and data chunking make HDF5-based systems as scalable as they can be. Finally, because the HDF5 I/O library is extensive and robust, and the HDF5 tool kit includes basic command-line and GUI tools, a platform is provided that allows for rapid prototyping, and reduced development time, thus making it easier to create new approaches for NGS data management and analysis.

For more information, or if you are interested in collaborating on the BioHDF project, please feel free to contact me (todd at

Monday, July 6, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part IV: HDF5 Benefits

Now that we're back from Alaska and done with the 4th of July fireworks, it's time to present the next installment of our series on BioHDF.

HDF highlights
HDF technology is designed for working with large amounts of scientific data and is well suited for Next Generation Sequencing (NGS). Scientific data are characterized by very large datasets that contain discrete numeric values, images, and other data, collected over time from different samples and locations. These data naturally organize into multidimensional arrays. To obtain scientific information and knowledge, we combine these complex datasets in different ways and (or) compare them to other data using multiple computational tools. One difficulty that plagues this work is that the software applications and systems for organizing the data, comparing datasets, and visualizing the results are complicated, resource intensive, and challenging to develop. Many of the development and implementation challenges can be overcome using the HDF5 file format and software library

Previous posts have covered:
1. An introduction
2. Background of the project
3. Complexities of NGS data analysis and performance advantages offered by the HDF platform.

HDF5 changes how we approach NGS data analysis.

As previously discussed, the NGS data analysis workflow can be broken into three phases. In the first phase (primary analysis) images are converted into short strings of bases. Next, the bases, represented individually or encoded as di-bases (SOLiD), are aligned to reference sequences (secondary analysis) to create derivative data types such as contigs or annotated tables of alignments, that are further analyzed (tertiary analysis) in comparative ways. Quantitative analysis applications, like gene expression, compare the results of secondary analyses between individual samples to measure gene expression and identify mRNA isoforms, or make other observations based on a sample’s origin or treatment.

The alignment phase of the data analysis workflow creates the core information. During this phase, reads are aligned to multiple kinds of reference data to understand sample and data quality, and obtain biological information. The general approach is to align reads to sets of sequence libraries (reference data). Each library contains a set of sequences that are annotated and organized to provide specific information.

Quality control measures can be added at this point. One way to measure data quality is to ask how many reads were obtained from constructs without inserts. Aligning the read data to a set of primers (individually and joined in different ways) that were used in the experiment, allows us to measure the number reads that match and how well they match. A higher quality dataset will have a larger proportion of sequences matching our sample and a smaller proportion of sequences that only match the primers. Similarly, different biological questions can be asked using libraries constructed of sequences that have biological meaning.

Aligning reads to sequence libraries is the easy part. The challenge is analyzing the alignments. Because the read datasets in NGS assays are large, organizing alignment data into forms we can query is hard. The problem is simplified by setting up multistage alignment processes as a set of filters. That is, reads that match one library are excluded from the next alignment. Differential questions are then asked by counting the numbers of reads that match each library. With this approach, each set of alignments is independent of the other alignments and a program only needs to analyze one set of alignments at time. Filter-based alignment is also used to distinguish reads with perfect matches from those with one or more mismatches.

Still, filter-based alignment approaches have several problems. When new sequence libraries are introduced, the entire multistage alignment process must be repeated to update results. Next, information about reads that have multiple matches in different libraries, or perfect matches and non-perfect matches within a library are lost. Finally, because alignment formats between programs differ and good methods for organizing alignment data do not exist, it is hard to compare alignments between multiple samples. This last issue also creates challenges for linking alignments to the original sequence data and formatting information for other tools.

As previously noted, solving the above problems requires that alignment data be organized in ways that facilitate computation. HDF5 provides the foundation to organize and store both read and alignment data to enable different kinds of data comparisons. This ability is demonstrated by the following two examples.

In the first example (left), reads from different sequencing platforms (SOLiD, Illumina, 454) were stored in HDF5. Illumina RNA-Seq reads from three different samples were aligned to the human genome and annotations from a UCSC GFF (genome file format) file were applied to define gene boundaries. The example shows the alignment data organized into three HDF5 files, one per sample, but in reality the data could have been stored in a single file or files organized in other ways. One of HDF's strengths is that the HDF5 I/O library can query multiple files as if they were a single file, providing the ability to create the high-level data organizations that are the most appropriate for a particular application or use case. With reads and alignments structured in these files, it is a simple matter to integrate data to view base (color) compositions for reads from different sequencing platforms, compare alternative splicing between samples, and select a subset of alignments from a specific genomic region, or gene, in a "wig" format for viewing in a tool like the UCSC genome browser.

The second example (right) focuses on how organizing alignment data in HDF5 can change how differential alignment problems are approached. When data are organized according to a model that defines granularity and relationships, it becomes easier to compute all alignments between reads and multiple reference sources, than think about how to perform differential alignments and implement the process. In this case, a set of reads (obtained from cDNA) are aligned to primers, QC data (ribosomal RNA [rRNA] and mitochondrial DNA [mtDNA]), miRBase, refseq transcripts, the human genome, and a library of exon junctions. During alignment up to three mismatches are tolerated between a read and its hit. Alignment data are stored in HDF5 and, because the data were not filtered, a greater variety of questions can be asked. Subtractive questions mimic the differential pipeline where alignments are used to filter reads from subsequent steps. At the same time, we can also ask "biological" questions about the number of reads that came from rRNA or mtDNA or from genes in the genome or exon junctions. And for these questions, we can examine the match quality between each read and its matching sequence in the reference data sources, without having to reprocess the same data multiple times.

The above examples demonstrate the benefits of being able to organize data into structures that are amenable to computation. When data are properly structured, new approaches that expand the ways in which data are analyzed can be implemented. HDF5 and its library of software routines move the development process from activities associated with optimizing the low level infrastructures needed to support such systems to designing and testing different data models and exploiting their features.

The final post of this series will cover why we chose to work with HDF5 technology.

Monday, June 22, 2009

Sneak Peak: RNA-Seq - Global Profiling of Gene Activity and Alternative Splicing

Join us June 30 at 10:00 am PDT. Eric Olson, Geospiza's VP of Product Development will present an interesting webinar on using RNA-Seq to measure gene expression and discover alternatively spliced messages using GeneSifter Analysis Edition.


Next Generation Sequencing applications such as RNA-Seq, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA-Seq data analysis process with emphasis on calculating gene and exon level expression values as well as identifying splice junctions from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and ABI’s SOLiD instruments.

To register visit the Geospiza webex event page.

Monday, June 15, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part III: The HDF5 Advantage

The Next Generation DNA Sequencing (NGS) bioinformatics bottleneck is related to the complexity of working with the data, analysis programs, and numerous output files that are produced as the data are converted from images to final results. Current systems lack well-organized data models and corresponding infrastructures for storing data and analysis information resulting in significant levels of data processing, reprocessing, and data copying redundancies. Systems can be improved with data management technologies like HDF5.

In this third installment of the bloginar, results from our initial work with HDF5 are presented. Previous posts have provided an introduction to the series and background on NGS.

Working with NGS data

With the exception of de novo sequencing, in which novel genomes, or transcriptomes, are analyzed, many NGS applications can be thought of as quantitative assays where DNA sequences are highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation of the bases between sequences in the alignments. Data tables from samples that differ by experimental treatment, environment, or populations, are compared in different ways to make additional discoveries and draw conclusions. Whether the assay is to measure gene expression, study small RNAs, understand gene regulation, or quantify genetic variation, a similar process is followed:
  1. A large dataset of sequence reads are collected in a massively parallel format.
  2. The reads are aligned to reference data.
  3. Alignment data, stored in multiple kinds of output files, are parsed, reformatted, and computationally organized to create assay specific reports.
  4. The reports are reviewed and decisions are made for how to work with the next sample, experiment, or assay.
Current practices where multiple programs create different kinds of information, formatted in different ways make this process difficult even for single samples. Achieving our main goal of comparing the analysis for multiple samples at a time is harder still. Presently, the four steps listed above must be repeated for each sample and multiple reports, that list expression values for single genes, or describe positional frequencies of read density, must be combined in some fashion to create new views to summarize or compute the differences and similarities between datasets. In the case of gene expression, for example, volcano plots can be used to compare the observed changes in gene expression with the likelihood that the observed changes are statistically significant. For a given gene, one might also want to drill into details that show how the reads align to the gene’s reference sequence. Further, the alignments from different samples for the gene need to be compared to see if there is evidence of alternative splicing or other interesting features.

Creating software that allows one to view NGS results in single and multi sample contexts, drill into multiple levels of detail, operates quickly and smoothly, and makes it possible for IT administrators and PIs to predict development and research costs, requires us to store raw data and corresponding analysis results in structures that support the computational needs for the problems being addressed. To accomplish this goal, we can either develop a brand new infrastructure to support our technical requirements and build new software to support our applications, or we can build software on an existing infrastructure and benefit from the experience gained in solving similar problems in other scientific fields.

Geospiza is following the latter path and is using an open-source technology, HDF5 (hierarchical data format), to develop highly scalable bioinformatics applications. Moreover, as we have examined past practices and consider our present and future challenges, we have concluded that technologies like HDF5 have great benefits for the bioinformatics field. Toward this goal, Geospiza has initiated a research program collaborating with The HDF Group to develop extensions to HDF5 that meet specific requirements for genetic analysis.

HDF5 Advantages

Introducing a new technology into an infrastructure requires work. Existing tools need to be refactored and new development practices must be learned. The cost of switching over to new technology has direct development costs associated with refactoring tools and learning new environments as well as a time lag between learning the system and producing new features. Justifying such a commitment demands a return on investment. Hence, the new technology must offer several advantages over current practices, such as improved systems performance and (or) new capabilities that are not easily possible with existing approaches. HDF5 offers both.

With HDF5 technology, we will be able to create better performing NGS data storage and high performance data processing systems, and approach data analysis problems differently.

We'll consider system performance first. Current NGS systems store reads and associated data primarily in text-based flat files. Additionally, the vast majority of alignment programs also store data in text-based flat files, creating the myriad of challenges described earlier. When these data are, instead, stored in HDF5, a number of improvements can be acheived. Because the HDF5 software library and file format can store data as compressed “chunks,” we can reduce storage requirements and access subsets of data more efficiently. For example, read data can be stored in arrays making it possible to quickly compute values like nucleotide frequency statistics for each base position in the reads from an entire multimillion read dataset.

In the example presented, 9.3 million Illumina GA reads were stored in HDF5 as a compressed two dimensional array resulting in a four fold reduction in size when compared to the original fasta formatted file. When the reads were aligned to a human genome reference, the flat file system grew from 609 MB to 1033 MB. The HDF5-based system increased in size by 230 MB to a total of 374 MB for all data and indices combined. In this simple example, the storage benefits of HDF5 are clear.

We can also demonstrate the benefits of improving the efficiency of accessing data. A common bioinformatics scenario is to align a set of sequences (queries) to a set of reference sequences (subjects) and then examine how the query sequences compare to the subject sequence within a specific range. Software routines accomplish this operation by getting the name (or ID) of a subject sequence along with the beginning and ending positions of the desired range(s). This information is used to first search the set of alignments for the names (or IDs) of the query sequences that match and query’s beginning and ending positions that match in the alignment. Next, the dataset of query sequences is searched to retrieve the matching data. When the data are stored in a non-indexed flat file, the entire file must be read to find the matching sequences. This takes, on average, half of the time needed to read the entire file. In contrast, indexed data can be accessed in a significantly reduced amount of time. The shorter time derives from two features: 1. A smaller amount of data needs to be read to conduct the search, and 2. Structured indices make searches more efficient.

In our example, the 9.3 million reads produced many millions of alignments when the data were compared to the human genome. We tested the performance for retrieving read alignment information from different kinds of file systems by accessing the alignments from successively smaller regions of chromosome 5. The entire chromosome contained roughly one million alignments. Retrieving the reads from the entire chromosome was slightly more efficient in HDF5 than retrieving the same data from the flat file system. However, as fewer reads were retrieved from smaller regions, the HDF5-based system demonstrated significantly better performance. For HDF5, the time to retrieve reads decreased as a function of the amount of data being retrieved down to 15 ms, the minimum overhead of running the program that accesses the HDF5 file. When compared to the minimum access time for the flat file (735 ms), a ~50 fold improvement is observed. As datasets continue to grow, the overhead for using the HDF5 system will remain at 15 ms, whereas the overhead for flat file systems will continue to increase.

The demonstrated performance advantages are not unique to HDF5. Similar results can be achieved by creating a data model to store the reads and alignments and implementing the model in a binary file format with indices to access the stored data in random ways. A significant advantage of HDF5 is that the software to implement the data models, build multiple kinds of indices, compress data in chunks, and read and write the data to and from the file has already been built, debugged, and supported by over 20 years of development. Hence, one of the most significant performance advantages associated with using the HDF platform is the savings in development time. To reproduce a similar, perhaps more specialized, system would require many months (even years) to develop, test, document, and refine the low-level software needed to make the system well-performing, highly scalable, and broadly usable. In our experience with HDF5, we’ve been able to learn the system, implement our data models, and develop the application code in a matter of weeks.

Consequently, we are spending more of our time solving the interesting challenges associated with analyzing millions of NGS reads from 100's or 1000's of samples to measure gene expression, identify alternatively spliced and small RNAs, study regulation, calculate sequence variation, and link summarized data to its underlying details, and we are spending a much smaller fraction of our time optimizing low-level infrastructures.

Additional examples of how HDF5 is changing our thinking will be presented next.