Tuesday, April 7, 2009

Have we been deluged?

In early March, Science published a perspective “Beyond the Data Deluge [1].” Last October, Nature Biotechnology (NBT) published an editorial “Prepare for the Deluge [2].” Did we miss the deluge?

Perhaps we are being deluged now

The articles' general themes were centered around the issue that scientists are having to work with more data than ever before. The NBT editorial focused on Next Generation DNA Sequencing (NGS) and covered many of the issues that we’ve also identified as important [3]. Selected quotes illustrate the challenges:
“Beyond the data management issues, significant computational infrastructure is also required.”

“In this respect, the open SOLiD software development community (http://solidsoftwaretools.com/gf/), which in July recruited two new software companies, is a step in the right direction.”

“For users, it would also be beneficial if data formats and the statistics for comparing performance of the different instruments could be standardized."

And one of my favorites: “At the moment, the cost of bioinformatics support threatens to keep the technology in the hands of the few. And this powerful technology clearly needs to be in the hands of many. If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow.”

We [biologists] are not alone.

Data-intensive science

The Science article presented a bigger picture. Written by luminaries (Gordon Bell, Tony Hey, and Alex Szalay) in the fields of computer science and astronomy, Bell, Hey, and Szalay point out that other fields like astronomy and particle physics have experiments generating petabytes of data per year. In bioinformatics, they noted how the extreme heterogeneity of data challenges scientists and how, in molecular biology, traditional hypothesis-led approaches are being replaced by a data-intensive inductive approaches. That is, collect a ton of data, observe patterns, and make discoveries. They go on to discuss how data “born digital” proliferate in files, spreadsheets, or databases and get scattered on hard drives and in Web sites, blogs, and wiki’s and how managing and curating these digital data is becoming increasingly burdensome. Ever have to move stuff around to make space, or spend too much time locating a file? What about your backups?

The article continues by discussing how data production is out pacing Moore’s Law (states that computer processors double in speed every 18 months or so), and how new kinds of clustered computing systems are needed to process the increasing flood of data. The late Jim Gray was a visionary in recognizing the need for new computer tools and defined a fourth research paradigm as data-intensive science.

Through the discussion a number of challenges to data-intensive science were presented. These are related to moving and accessing data, subtleties of organizing and defining data (also called schemas, ontologies, and semantics), and non-standard uses of technology. An example was cited where large database systems hold only pointers to files rather than the data within the files making direct analysis of the data impractical. A case study of astronomy community was provided to share how one discipline has successfully transitioned to data-intensive science.

In closing the article, the authors indicate that while, data-intensive science requires specialized skill and analysis tools, the research community should be optimistic because, today, more options exist for accessing storage and computing resources on demand through "cloud services" that are being built by the IT industry. Cloud services provide high-bandwidth cost-effective services, to provide computing capabilities that are beyond the financial scope of universities and national laboratories. According to the authors, this model offers an attractive solution to the scale problem, but requires radical thinking to make optimal use of these services. Bell and colleagues close with the following quote:
“In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies.”
OK, we're radical

Geospiza been helping labs and research groups meet current and future bioinformatics challenges through a number of creative approaches. For example, Nature Biotechnology lauded the SOLiD community program; Geospiza was one of those first two companies. Geospiza is also actively participating in the community through consortium actives like SEQC to develop standardized approaches to working with NGS systems and data. We have useful ways of analyzing and visualizing information in NGS data and our BioHDF project is tackling the deeper issues related to working with large amounts of extremely heterogeneous data and high performance computing so we can go beyond simply referencing files from databases and analyze these data more extensively than is currently possible. Finally, our GeneSifter products have used cloud computing from the very beginning and we have been successfully extending the cloud computing model to NGS.

References and additional reading

[1] 2008. Prepare for the deluge. Nat Biotechnol 26, 1099.
[2] Bell G., Hey T., Szalay A., 2009. Computer science. Beyond the data deluge. Science 323, 1297-1298.

[3] FinchTalks:
Journal Club: Focus on Next Gen Sequencing
The IT problem
The bioinformatics problem

BioHDF FinchTalks:
BOSC Abstract and Presentation
Introducing BioHDF
The case for HDF
Genotyping with HDF

No comments: