FinchTalk: March 2008

Monday, March 31, 2008

Next Gen, Next Step

Congratulations! You just got approval to purchase your next generation sequencer! What are you going to do next?

Today, there is a lot being written about the data deluge accompanying Next Gen sequencers. It's true, they produce a lot of data. But even more important are the questions about how you plan to set up the lab and data workflows to turn those precious samples into meaningful information. The IT problems, while significant, are only the tip of the iceberg. If you operate a single lab, you will need to think about your experiments, how to track your samples, how to prepare DNA for analysis, how to move the data around for analysis, and how to do your analyses to get meaningful information out of the data. If you operate a core lab, you have all the same problems, but you're providing that service for a whole community of scientists. You'll need to keep their samples and data separated and secure. You also have to figure out how to get the data to your customers and how you might help them with their analyses.

Never mind that you need multi terabytes of storage and a computer cluster. Without a plan and strategy for running your lab, organizing the data, running multistep analysis procedures, and sifting through 100's of thousands of alignments, you'll just end up with a piece of lab art: a Next Gen sequencer, a big storage system and a computer cluster. (By the way, have you found a place for this yet?) It may look nice, but that's probably not what you had in mind.

To get the most of out of your investment, you'll need to think about workflows, and how to manage those workflows.

The cool thing about Next Gen technology are the kinds of questions that can be asked with the data. This requires both novel ways to work with DNA and RNA and novel ways to work with the data. We call those procedures "workflows." Simply put, a workflow describes a multistep procedure and its decision points. In each step, we work with materials and the materials may be "transformed" in the step. You can also describe a workflow as a series of steps that have inputs and outputs. Workflows are run both in the lab and on the computer.

In a protocol for isolating DNA , we can take tissue (the input) lyse the cells with detergent, bind the DNA to a resin, wash away junk, and elute purified DNA (the output). The purified DNA may then become an input to a next step, like PCR, to create an output, like a collection of amplicons. Similar processes can be used with RNA. In a Next Gen lab workflow, you fragment the DNA, ligate adaptors, and use the adaptors to attach DNA to beads or regions of a slide. From a few basic lab workflows, we can prepare genetic material for whole genome analysis, expression analysis, variation analysis, gene regulation, and other experiments in both discovery, and diagnostic assays.

In a software workflow, data are the material. Input data, typically packaged in files, are processed by programs to create output data. These data or information can also be packaged in files or even stored in databases. Software programs execute the steps and scripts often automate series of steps. Digital photography, multimedia production, and business processes all have workflows. So does bioinformatics. The difference is that bioinformatics workflows lack standards so many people work harder than needed and spend a lot of time debugging things.

As the scale increases, the lab and analysis workflows must be managed together.

A common laboratory practice has been to collect the data, and then analyze the data in separate independent steps. Lab work is often tracked on paper, in Excel spreadsheets, or in a LIMS (Laboratory Information Management System). The linkage between lab processes, raw data, and final results, is typically poor. In small projects, this is manageable. File naming conventions can track details and computer directories (folders) can be used to organize data files. But as the scale grows, the file names get longer and longer, people spend considerable time moving and renaming data, the data start to get mixed up, become harder to find, and for some reason files start to replicate themselves. Now, the lab investigates tracking problems and lost data, instead of doing experiments.

Why? Because the lab and data analysis systems are disconnected.

The good news is that Geospiza Finch products can link your lab procedures and the data handling procedures to create complete workflows for genetic analysis.

Monday, March 24, 2008

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Next generation sequencing will transform sequencing assays and experiments. Understanding how the data are generated is important for interpreting results.

In my last post, I discussed how all measurement systems have uncertainty (error) and how error probability is determined in Sanger sequencing. In this post, I discuss the current common next generation (Next Gen) technologies and their measurement uncertainty.

The Levels of Data - To talk about Next Gen data and errors, it is useful to have a framework for describing these data. In 2005, the late Jim Gray and colleagues published a report addressing scientific data management in the coming decade. The report defines data in three levels (Level 0, 1, and 2). Raw data (Level 0), the immediate product of analytical instruments, must be calibrated and transformed into Level 1 data. Level 1 data are then combined with other data to facilitate analysis (Level 2 datasets). Interesting science happens when we make different combinations of Level 2 data and have references to Level 1 data for verification.

How does our data level framework apply to the DNA sequencing world?

In Sanger sequencing, Level 0 data are the raw analog signal data that are collected from the laser and converted to digital information. Digital signals, displayed as waveforms, are mobility corrected and basecalled to produce Level 1 data files that contain information about the collection event, the DNA sequence (read), quality values, and averaged intensity signals. Read sequences are then combined together, or with reference data, to produce Level 2 data in the form of aligned sequence datasets. These Level 2 data have many context specific meanings. They could be a list of annotations, a density plot in a genome browser view, an assembly layout, or a panel of discrepancies that use quality values (from Level 1) to distinguish true genetic variation from errors (uncertainty) in the data.

Next Gen data are different.

When the "levels of data" framework is used to explore Next Gen sequencing technology, we see fundamental differences between Level 0 and Level 1 data . In Sanger sequencing, we perform a reaction to create a mixture of fluorescently tagged molecules that differ in length by a single base. The mixture is then resolved by electrophoresis with continual detection of these size separated tagged DNA molecules to ultimately create a DNA sequence. Uncertainties in the basecalls are related to electrophoresis conditions, compressions due to DNA secondary structure, and the fact that some positions have mixed bases because the sequencing reactions contain a collection of molecules.

In Next Gen sequencing, molecules are no longer separated by size. Sequencing is done "in place" on DNA molecules that were amplified from single molecules. These amplified DNA molecules are either anchored to beads (that are later randomly bound to slides or picoliter wells) or anchored to random locations on slides. Next Gen reactions then involve multiple cycles of chemical reaction (flow) followed by detection. The sample preparation methods and reaction/detection cycles are where the current Next Gen technologies greatly differ and have their unique error profiles.

Next Gen Level 0 data changes from electropherograms to huge collections of images produced through numerous cycles of base (or DNA) extensions and image analysis. Unlike Sanger sequencing, where both raw and processed data can be stored relatively easily in single files, Next Gen Level 0 data sets can be quite large and multiple terabytes can be created per run. Presently there is active debate on the virtues of storing Level 0 data (the Data Life Cycle). The Level 0 image collections are used to create Level 1 data. Level 1 data are a continuum of information that are ultimately used to produce reads and quality values. Next Gen quality values reflect the basic underlying sequencing chemistry and primary error modes, and thus calculations need to be optimized for each platform. For those of you who are familiar with phred, these are analogous to the settings in the phredpar.dat file. The other feature of Level 1 data are that they can be expressed differently.

The Spaces

Flow Space - The Roche 454 technology, is based on pyrosequencing [1] and the measured signals are a function of how many bases are incorporated in a base addition cycle. In pyrosequencing, we measure the pyrophosphate (PPi) that is released when a nucleotide is added to a growing DNA chain. If multiple bases are added in a cycle, two or more A's for example, proportionally more (PPi) is released and the light detected is more intense. As more bases are added in a row, the relative increase in light decreases; an 11/10 change ratio for example, is much lower than 2/1. Consequently, when there are longer sequences with the same base (e.g. AAAAA), it becomes harder to count the number of bases accurately, and the error rate increases. Flow space describes a sequence in terms of base incorporations. That is, the data are represented as an ordered string of bases plus the number of bases at each base site. The 454 software performs alignment calculations in flow space.

Color Space - The Applied Biosystems SOLiD technology uses DNA ligation [2] with collections of synthetic DNA molecules (oligos) that contain two nested known bases with a fluorescent dye. In each cycle these two bases are read at intervals of five bases. Getting full sequence coverage (25, 35, or more bases), requires multiple ligation and priming cycles such that each base is sequenced twice. The SOLiD technology uses this redundancy to decrease the error probability. The Level 1 sequence is reported in a “color” space, where the numbers 0,1,2,3 are used to represent one of the fluorescent colors. Color space must be decoded into a DNA sequence at the final stage of data processing. Like flow space, it is best to perform alignments using data in color space. Since each color represents two bases, decoding color space requires that the first base be known. This base came from the adapter.

Sequence Space - Illumina’s Solexa technology uses single base extensions of fluorescent-labeled nucleotides with protected 3'-OH groups. After base addition and detection, the 3'-OH is deprotected and the cycle repeated. The error probabilities are calculated in different ways by first analyzing the raw intensity files (remember firecrest and bustard) and then by alignment with the ELAND program to compute an empirical probability error. With Solexa data, errors occur more frequently at the ends of reads.

So, what does this all mean?

Next Gen technologies are all different in the way data are produced, they all produce different kinds of Level 1 data, and the Level 1 data are best analyzed within their own unique "space." This can have data integration implications depending on how you might want to combine data sets (Level 1 vs Level 2). Clearly, it is important to have easy access to quality values and error rates to help troubleshoot issues with runs and samples. The challenge is sifting out the important data from a morass of sequence files, quality files, data encodings, and other vagaries of the instruments and their software. Geospiza is good at this and we can help.

In terms of science, many assays and experiments, like Tag and Count, or whole genome assembly, can use redundancy to overcome random data errors. Data redundancy can also be used to develop rules to validate variations in DNA sequences. But, this is the tip of the iceberg. With Next Gen's extremely high sampling rates and molecular resolution, we can begin to think of finding very rare differences in complex mixtures of DNA molecules and use sequencing as a quantitative assay. In these cases understanding how the data are created and their corresponding uncertainties are a necessary step toward making full use of the data. The good news is that there is some good work being done in this area.

Remember, the differences we measure must be greater than the combined uncertainty of our measurements.

Further reading
1. Ronaghi, M., M. Uhlen, and P. Nyren, "A sequencing method based on real-time pyrophosphate." Science, 1998. 281(5375): p. 363, 365.

2. Shendure, J., G.J. Porreca, N.B. Reppas, et al., "Accurate multiplex polony sequencing of an evolved bacterial genome." Science, 2005. 309(5741): p. 1728-32.

3. "Quality scores and SNP detection in sequencing-by-synthesis systems." http://www.genome.org/cgi/content/abstract/gr.070227.107v2

4. "Scientific data management in the coming decade." http://research.microsoft.com/research/pubs/view.aspx?tr_id=860

5. Solexa quality values. http://rcdev.umassmed.edu/pipeline/Alignment%20Scoring%20Guide%20and%20FAQ.html
http://rcdev.umassmed.edu/pipeline/What%20do%20the%20different%20files%20in%20an%20analysis%20directory%20mean.html

Monday, March 17, 2008

Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Next generation DNA sequencing introduces new concepts like color space, flow space, and sequence space. You might ask, what's a space? How do I deal with these spaces? Why are they important?

In this two part blog, I will first talk about error analysis in DNA sequencing. Next I will talk about how we might think about error analysis in next generation sequencing.

Last week I came across a story about an MIT physics professor, Walter Lewin, who captivates his student audiences with his lectures and creative demonstrations. MIT and iTunes have 100 of his lectures on line. I checked out the first one - your basic first college physics lecture that focuses on measurement and dimensional analysis - and agree, Lewin is captivating. I watched the entire lecture, and it made me think about DNA sequencing.

In the lecture, Lewin, proves "physics works!" and how his grandmother was right when she said that you are inch taller when laying down than when standing up. He used a student subject and measured his length laying down and standing up. Sure enough, the student was an inch longer laying down. But that was not the point. The point was - Lewin proved his grandmother was right because the change in the student's length was greater than the uncertainty of his measuring device (the ruler). Every measurement we make has uncertainty, or error, and for a comparison to be valid the difference in measures have to be greater than their combined uncertainties.

What does this have to do with DNA sequencing?

Each time we collect DNA sequence data we are making many measurements. That is, we are determining the bases of a DNA sample template in an in vitro replication process that allows us to "read" each base of the sequence. The measurements we collect, the string of DNA bases, therefore have uncertainty. We call this uncertainty in base measurement the error probability. In Sanger sequencing, Phil Green and Brent Ewing developed the Phred basecalling algorithm to measure per base error probabilities.

Error probabilities are small numbers (1/100, 1/10,000, 1/1,000,000). Rather than work with small fractions and decimal values with many leading zeros, we express error probabilities as positive whole integers, called quality values (QVs), by applying a transformation:

QV = -l0*log(P), where P is the error probability.

With this transformation our 1/100, 1/10,000, and 1/1,000,000 error probabilities become QVs of 20, 40, and 60, respectively.

The Phred basecalling algorithm has had a significant impact on DNA sequencing because it demonstrated that we could systematically measure the uncertainty of each base determination in a DNA sequence. Over the past 10 years, Phred quality values have been calibrated through many resequencing projects and are thus statistically powerful. An issue with Phred, and any basecaller, however is that it must be calibrated for different electrophoresis instruments (measurement devices) and that is why different errors and error rates can be observed with different combinations of basecallers and instruments.

Sequencing redundancy also reduces error probabilities

The gold standard in DNA sequencing is to sequence both strands of a DNA molecule. This is for good reason. Each stand represents an independent measurement. If our measurements agree, they can be better trusted, and if they disagree one needs to look more closely at the underlying data, or remeasure. This concept was also incorporated into Green's assembly program Phrap (unpublished).

Within the high throughput genomics community it is well understood that increasing the redundancy of data collection reduces error. In theory, one can automate the interpretation of DNA sequencing experiments, or assays, by collecting data at sufficient redundancy. The converse is also true, and I see people work the hardest with manually reviewing data when they do not collect enough. This is most common with variant detection resequencing assays.

Why isn't high redundancy data collection routine?

The challenges with high redundancy data collection in Sanger sequencing involve the high relative costs of collecting data and higher costs of collecting data from single molecules. Next generation (Next Gen) sequencing changes this landscape.

The higher throughput rates and lower costs of Next Gen sequencing hold great promise for revolutionizing genomics research and molecular diagnostics. In a single instrument run, an Expression Sequence Tag (EST) experiment can yield millions of sequences and detect rare transcripts that cannot be found any other way [1-3]. In cancer research, high sampling rates will allow for the detection of rare sequence variants in populations of tumor cells that could be prognostic indicators or provide insights for new therapeutics [1, 4, 5]. In viral assays, it will be possible to determine the sequence of individual viral genomes and detect drug resistant strains as they appear [6, 7]. Next Gen sequencing has considerable appeal because the large numbers of sequences that can be obtained make statistical calculations more valid.

Making statistical calculations valid, however, requires that we understand the inherit uncertainty of our measuring device. In this case, the different Next Gen genetic analyzers. That's where color space, flow space, and other spaces come into play.

Further Reading
1. Meyer, M., U. Stenzel, S. Myles, K. Prufer, and M. Hofreiter, Targeted high-throughput sequencing of tagged nucleic acid samples. Nucleic Acids Res, 2007. 35(15): p. e97.
2. Korbel, J.O., A.E. Urban, J.P. Affourtit, et al., Paired-end mapping reveals extensive structural variation in the human genome. Science, 2007. 318(5849): p. 420-6.
3. Wicker, T., E. Schlagenhauf, A. Graner, T.J. Close, B. Keller, and N. Stein, 454 sequencing put to the test using the complex genome of barley. BMC Genomics, 2006. 7: p. 275.
4. Taylor, K.H., R.S. Kramer, J.W. Davis, J. Guo, D.J. Duff, D. Xu, C.W. Caldwell, and H. Shi, Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res, 2007. 67(18): p. 8511-8.
5. Highlander, S.K., K.G. Hulten, X. Qin, et al., Subtle genetic changes enhance virulence of methicillin resistant and sensitive Staphylococcus aureus. BMC Microbiol, 2007. 7(1): p. 99.
6. Wang, G.P., A. Ciuffi, J. Leipzig, C.C. Berry, and F.D. Bushman, HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications. Genome Res, 2007. 17(8): p. 1186-94.
7. Hoffmann, C., N. Minkah, J. Leipzig, G. Wang, M.Q. Arens, P. Tebas, and F.D. Bushman, DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.

Thursday, March 13, 2008

What's a Bustard?

For that matter, what's a Firecrest? or a Gerald? Many with an Illumina Genome Analyzer are now learning these are the directories that have the data they may be interested in.

What's in those directories?

In this post, we explore some of the data in the directories, talk about what data might be important, and use FinchLab Next Gen Edition (FinchLab NG) to look at some of the files. In the Next Gen world we are also going to be learning about the data life cycle. When you are thinking about how to store three or four or ten terabytes (TB) of data for each run, and considering that you might run your instrument 40 or 50 times or more in the next year, you might stop and ask the question, "how much of that data is really important and for how long?" That's the data life cycle. It's going to be important.

To begin our understanding, let's look at the data being created in a run. When an Illumina Genome Analyzer (aka Solexa) collects data, many things happen. First, images are collected for each cycle in a run and tile in a lane on a slide. They're pretty small, but there are a lot, maybe 360,000 or so and they add up to the terabytes we talk about. These images are analyzed to create tens of thousands (about 100 gigabytes [GB] worth) of "raw intensity files" that go in the Firecrest directory. Next, a base-calling algorithm reads the raw intensity files to create sequence, quality and other files (about 80 GB worth) that go in the Bustard directory. The last step is the Eland program pipleline. It reads the Bustard files, aligns their data to reference sequences, makes more quality calculations, and creates more files. These data go in the Gerald directory to give about 25 or 30 GB of sequence and quality data.

So, what's the best data to work with? That depends on what problem you are trying to solve. Specialists developing new basecalling tools or alignment tools might focus on the data in Firecrest and Bustard. Most researchers, however, are going to work with data in the Gerald directory. That reduces our TB problem down to a tens of GB problem. That's a big difference!

FinchLab NG can help.

FinchLab NG gives you the LIMS capabilities to run your Next Gen laboratory workflows and track which samples go on which slides and where on the slide the samples go. We call this part the run. When a run is complete you can link the data on your filesystem to FinchLab NG and use the web interfaces to explore the data. You can also link specific data files to samples. So, if you are sharing data or operating a core lab your researchers can easily access their data through their Finch account.

The screen shot below gives an example of how the HTML quality files can be explored. It shows two windows, the one on the left is the FinchLab NG with data for a Solexa run. You can see that the directory has 3606 files and a number are htm summary files. You can find these 12 files in that directory of 3606 files entering "htm" in the Finch Finder.The window on the right was obtained by clicking the "Intensity Plots" link that is directly below the info table and just above the data list. In this example the intensity plots are shown for each tile of the 8 lanes on the slide. To see this better click the image and zoom in with your browser.

Sunday, March 9, 2008

Using the Finder

FinchLab, iFinch, Finch Suite, and FinchLab Next Gen Edition all use a tool that we call the Finder to help you locate data by selected criteria.

This video shows some quick tips on using the Finder with iFinch as an example.

Using the finder from Sandra Porter on Vimeo.

Sunday, March 2, 2008

Genotyping with HDF

To continue our progress describing HDF and its value in bioinformatics, I present the work Geospiza and THG performed in developing a prototype application for genotyping. In this project we implemented a data model, based on polyPhred, in HDF5 to demonstrate HDF5's data organization capabilities. We further demonstrated HDF5's strengths for compressing and accessing data by adding HapMap genotype data sets and data from chromosome level linkage disequilibrium calculations.

Organizing Data - A resequencing project begins with a region of a genome, a gene or set of genes that will be studied. A researcher will have sets of samples from patient populations from which they will isolate DNA. PCR is used to amplify specific regions of DNA so that both chromosomal copies can be sequenced. The read data, obtained from chromatograms, are aligned to other reads and also to a reference sequence. Quality and trace information are used to predict whether the differences observed between reads and reference data are meaningful. The general data organization can be broken into a main part called the Gene Model and within it, two sub organizations: the reference and experimental data. The reference consists of the known state of information. Resequencing, by definition, focuses on comparing new data with a reference model. The reference model organizes all of the reference data including the reference sequence and annotations.

The sub organizations of data can be stored in HDF5 using its standard features. The two primary objects in HDF5 are "groups" and "datasets." The HDF5 group object is akin to a UNIX directory or Windows folder – it is a container for other objects. An HDF5 dataset is an array structure for storing data and has a rich set of types available for defining the elements in an HDF dataset array. They include simple scalars (e.g., integers and floats), fixed- and variable-length structures (e.g., strings), and "compound datatypes." A compound datatype is a record structure, whose fields can be any other datatype, including other compound types. Datasets and groups can contain attributes, which are simple name/value pairs for storing metadata. Finally, groups can be linked to any dataset or group in an HDF file, making it possible to show relationships among objects. These HDF objects allowed us to create an HDF file whose structure matched the structure and content of the Gene Model structure. Although the content of a Gene Model file is quite extensive and complex, the grouping and dataset structure of HDF makes it very easy to see the overall organization of the experiment. Since all the pieces are in one place, an application, or someone browsing the file, can easily find and access the particular data of interest. The figure to the left shows the HDF5 file structure we used. The ovals represent groups and the rectangles represent datasets. The grayed out groups (Genotyping, Expression, Proteomics) were not implemented.

Accessing the Data - HDF5's feasibility, and several advantageous features, are demonstrated by a screen shot obtained using HDFView, a cross platform Java-based application, that can be used view data stored in HDF5. This image below highlights the ability of HDF5, and HDF5-supporting technologies to meet the following requirements:

Support combining a large number of files
Provide simple navigation and access to data objects
Support data analysis
Integrate diverse data types

The left most panel of the screen (below) presents an "explorer" view of four HDF5 files (HapMap, LD, ADRB2, and FVIII), with their accompanying groups and datasets. Today, researchers store these data in separate files scattered throughout file systems. To share results with a colleague, they e-mail multiple spreadsheets or tab-delimited text files for each table of data. When all of the sequence data, basecall tables, assemblies, and genotype information are considered, the number of files becomes significant. For ADRB2 we combined the data from 309 individual files into a single HDF5 file. For FVIII, a genotyping study involving 39 amplicons and 137 patient samples, this number grows to more than 60,000 primary and versioned copied files.

With HDF5 these data are encapsulated in a single file thus simplifying data management in increasing data portability.

Example screen from the prototype demo. HDFView, a JAVA viewer for HDF5 files can display multiple HDF5 files and for each file, the structure of the data in the file. Datasets can be shown as tables, line plots, histograms and images. This example shows a HapMap dataset, LD calculations for a region of chromosome 22 and the data from two resequencing projects. The HapMap dataset (upper left) is a 52,636-row table of alleles from chromosome 22. Below it is an LD plot from the same chromosome. The resequencing projects, adrb2 and factor 8, show reference data and sequencing data. The table (middle) is a subsection of the poly table obtained from Phred. Using the line plot feature in HDFView, sub sections of the table were graphed. The upper right graph compares the called base peak areas (black line, top) to the uncalled peak areas (red, bottom) for the entire trace. The middle right graph highlights the region between 250 and 300 bases. The large peak at position 36 (position 286 in the center table, and top graph) represents a mixed base. The lower right graph is a "SNPshot" showing the trace data surrounding the variation.

In addition to reducing file management complexity, HDF5 and HDFView have a number of data analysis features that make it possible to deliver research-quality applications quickly. In the ADRB2 case, the middle table in the screen shot is a section of one of the basecall tables produced by Phred using its "-d" option. This table was opened by selecting the parent table and defining the columns and region to display. As this is done via HDF5's API, it is easy to envision a program "pulling" relevant slices of data from a data object, performing calculations from the data slices and storing the information back as a table that can be viewed from the interface. Also important, the data in this example are accessible and not "locked" away in a proprietary system. Not only is HDF an open format, HDFView allows one to export data as HDF subsets, images, and tab delimited tables. HDFView's copy functions allow one to easily copy data into other programs like Excel.

HDFView can produce basic line graphs that can be used immediately for data analysis, such as the two that are shown here for ADRB2. The two plots, in the upper right corner of the screen show the areas of the peaks for called (black, upper line) and uncalled (red, lower line) bases. The polymorphic base can be seen in the top plot as a spike in the secondary peak area. The lower graph contains the same data, plotted from the region between base 250 and 300 of the read. This plot shows a high secondary peak with a concomitant reduction in the primary peak area. One of PolyPhred's criteria for identifying a heterozygous base, that primary and secondary peak areas are similar, is easily observed. The significance of this demonstration is that HDF5 and HDFView have significant potential in algorithm development, because they can provide rapid access to different views of the data.

More significantly HDFView was used without any modifications demonstrating the benefit of a standard implementation system like HDF5.

Combining Diverse Data - The screen shot also shows the ability of HDF5 to combine and present diverse data types. Data from a single file containing both SNP discovery projects are shown, in addition to HapMap data (chromosome 22) and an LD plot consisting of a 1000 x 1000 array of LD values from a region of chromosome 22.

As we worked on this project, we became more aware of the technology limitations that hinder work with the HapMap and very large LD datasets and concluded that the HapMap data would provide an excellent test case for evaluating the ability of HDF5 to handle extremely large datasets.

For this test, a chromosome 22 genotype dataset was obtained from HapMap.org. Uncompressed, this is a large file (~24MB), consisting of a row of header information followed by approximately 52,000 rows of genotyped data. As a text file, it is virtually indecipherable and needs to be parsed and converted to a tabular data structure before the data can be made useful. When one considers that even for the smallest chromosome, the dataset is close to Microsoft Excel's (2003, 2004) row limit (65,535), the barrier to entry for the average biologist wishing to the use this data is quite high.

To put the data into HDF5 we made an XML schema of the structure to understand the model, built a parser, and loaded the data. As can be seen in HDFView (Fig. 6), the data were successfully converted from a difficult-to-read text form to a well-structured table where all of
the data can be displayed. At this point, HDFView can be used to extract and analyze subsets of information.

Next, we asked if HDF5 could be used to observe long range LD at the chromosome level. This is an important question that cannot be answered by current technology. Using the r2 algorithm, we computed the LD values for the 53,000 SNPs in chromosome 22 and produced a 53,000 x 53,000 array of values. These data would require 5.2 Gigabytes using a conventional file format. Since most of the values in this array are "0", the file can compress quite well. However, with conventional "gzip" methods, it must be uncompressed in order to be displayed, even if one wants only to display a small part of the entire image. Not only does such an operation take a long time, but common computer configurations lack sufficient memory to store such a large uncompressed image.

The LD test demonstrates the power of HDF5's collection of sophisticated storage structures. We have seen that we can compress datasets inside an HDF5 file, but we also see that compressing an entire dataset creates access problems. HDF5 deals with this problem through a storage feature called "chunking." Whereas most file formats store an array as a contiguous stream of bytes, chunking in HDF5 involves dividing a dataset into equal-sized "chunks" that are stored and accessed separately.

LD plot for chromosome 22. A 1000 x 1000 point array of LD calculations is shown. The table in the upper right shows the LD data for a very small region of the plot, demonstrating HDF's ability to allow one to easily select "slices" of datasets.

Chunking has many important benefits, two of which apply particularly to large LD arrays. Chunking makes it possible to achieve good performance when accessing subsets of the datasets, even when the chosen subset is orthogonal to the normal storage order of the dataset. If a very large LD array is stored contiguously, most sub-setting operations require a large number of accesses, as data elements with spatial locality can be stored a long way from one another on a disk, requiring many seeks and reads. With chunking, spatial locality is preserved in the storage structure, resulting in faster access to subsets. When a data subset is accessed, only the chunks that contain specific portions of the data need to be un-compressed. The extra memory and time required to uncompress the entire LD array are both avoided.

Using both chunking and compression, HDF5 compressed the data in the chromosome 22 LD array from 5.2 gigabytes to 300 megabytes, a 17-fold decrease in storage space. Furthermore, the LD array was then immediately available for viewing in HDFView, where it was converted to an image with color intensity used to show higher linkage. The display also shows a table of LD values corresponding to a subset of the larger LD plot. In HDFView, one can "box" select a region of pixels from an image and use it to create a subset of data. This is an important feature, as it will be impossible to view an entire chromosome LD plot at single pixel resolution. Thus, a matrix of lower resolution regions will need to be created and viewed in HDFView. The lower resolution image can highlight regions of high LD and, using a tool like HDFView, one can then select those regions and drill down into the underlying data.

HDF5 has many practical benefits for bioinformatics. As a standardized file technology data models can be implemented and tools like HDFView can be used to quickly visualize the organization of the data and results of computation. Computational scientists can develop new algorithms faster because they do not have to invest time developing new formats and new GUIs to view their data. The community can benefit because data become more portable. Finally, HDF is well suited for enhancing application performance through data compression, chunking, and memory mapping. Many of these features will become extremely valuable as Next Gen technologies push the volumes of data to higher and higher levels.