"The data sets are astronomical," "the data that needs to be attached to sequences is unbelievable," and "browsing [data] is incomprehensible." These are just three of the many quotes I heard about the challenges associated with DNA sequencing last week at the "Finishing in the Future Meeting" sponsored by the Joint Genome Institute (JGI) and Los Alamos National Laboratory (LANL).
The two and half day conference, focused on finishing genomic sequences, kicked off with a session on metagenomics. Metagenomics is about isolating DNA from environments and sequencing random molecules to "see what's out there." Excitement for metagenomics is being driven by Next Gen sequencing throughput, because so many sequences can be collected relatively inexpensively. A benefit of being able to collect such large data sets is that we can interrogate organisms that can cannot be cultured. The first talk, "Defining the Human Microbiome: Friends or Family," was presented by Bruce Birren from the Broad Institute of MIT & Harvard. In this talk, we learned about the HMP (Human Microbiome Project), a project dedicated to characterizing the microbes that live on our bodies. It is estimated that microbial cells out number our cells by ten to one. It has long been speculated that our microbiomes are involved in our health and sickness and recent studies are confirming these ideas.
Sequencing technologies continue to increase data throughput
The afternoon session opened with presentations from Roche (454), Illumina, and Applied Biosystems on their respective Next Gen sequencing platforms. Each company presented the strengths of their platform and new discoveries that are being made by virtue of having a lot of data. Each company also presented data on improvements designed to produce even more data and road maps for future improvement to produce even more data. As Haley Fiske from Illumina put it, "we're in the middle of an arms race!" Finally, all the companies are working on molecular barcodes, so that multiple samples can be analyzed within an experiment. So, we started with a lot of data from a sample and are going to a lot of data from a lot of samples. That should add some very nice complexity to sample and data tracking.
A unique perspective
Sydney Brenner opened the second day with a talk on "The Unfinished Genome." The thing I like most about a Sydney Brenner talk is how he puts ideas together. In this talk he presented how one could look at existing data and literature to figure things out or make new discoveries. In one example, he speculated on when the genes for eye development may have first appeared. From the physiology of the eye you can use the biochemistry of vision to identify the genes that encode the various proteins involved in the process. These proteins are often involved in other process, but differ slightly. They arise from gene duplication and modification. So, you can look at gene duplications and measure the age of a duplication by looking at neighboring genes. If a duplication event is old, neighboring genes will be unequal distances apart. You can use this information, along with phylogenetic data, to estimate when the events occurred. Of course this kind of study benefits from more sequence data. Sydney encouraged everyone to keep sequencing.
Sydney closed his talk by making a fun analogy where genomics is like astronomy and thus should have been called "genomy." He supported his analogy by noting that astronomy has astro physic and genomics has genetics. Both are quantitative and measure history and evolution. Astronomy also has astrology, the prediction of an individual's future from the stars. Similarly, folks would like to predict an individual's future from their genes and suggested we call this work "Genology," since it has the same kind of scientific foundation as astrology.
Challenges and solutions
The rest of the conference and posters focused on finishing projects. Today the genome centers are making use of all the platforms to generate large data sets and finish projects. A challenge for genomics is lowering finishing costs. The problem being that generating "draft" data has become so inexpensive and fast that finishing has become a signifiant bottleneck. Finishing is needed to produce the high quality referece sequences that will inform our genomic science, so investigarting ways to lower finishing costs is a worthwhile endeavour. Genome centers are approaching this problem by looking at ways to mix data from different technologies such as 454 and Illumina or SOLiD. They are also developing new and mixed software approaches such as combining multiple assembly algorithms to improve alignments. These efforts are being conducted in conjunction with experiments where mixtures of single pass and paired read data sets are tested to determine optimal approaches for closing gaps.
The take home from this meeting is that, over the coming years, a multitude of new approaches and software programs will emerge to enable genome scale science. The current technology providers are aggressively working to increase data throughput, data quality and read length to make their platforms as flexible as possible. New technology providers are making progress on even higher throughput platforms. Computer scientists are working hard on new algorithms and data visualizations to handle the data. Molecular barcodes will allow for greater numbers of samples per data collection event and increase sample tracking complexity.
The bottom line
Individual research groups will continue to have increasing access to "genome center scale" technology. However, the challenges with sample tracking, data management, and data analysis will be daunting. Research groups with interesting problems will be cut off from these technologies unless they have access to cost-effective, robust informatics infrastructures. They will need help setting up their labs, organizing the data, and making use of new and emerging software technologies.
That's where Geospiza can help.