Tuesday, April 28, 2009

Life in the Clouds

Today, the Applied Biosystems division of Life Technologies announced their partnership with us (Geospiza) to use Amazon Web Services (AWS) to use cloud-computing technologies to help customers manage data from advanced genomic analysis platforms.

This news is significant for several reasons.

First, as noted by our President, Rob Arnold, in Xconomy, this is the first time a leading gene sequencing company has agreed to offer the sequencing instrument, the consumable chemicals needed to run experiments, and the software needed to sort through and make sense of the data, in a single package and run the software under a SaaS model.

Second, through this news, and other activities, we are proactively addressing one of the major challenges for Next Generation Sequencing. Specifically, that the costs and time for purchasing, deploying, and maintaining IT systems to support NGS data management and analysis are simply out of reach for the vast majority of research groups that can benefit from these technologies. Presently, the groups making the greatest progress with NGS have advanced bioinformatics and IT support. However, if these technologies are going to truly meet their promise of revolutionizing genomics, the numbers of scientists utilizing NGS needs to increase. All over the country (and world) medical researchers, microbiologists, plant biologists, and other scientists have interesting samples and new ideas for which NGS experiments will provide amazing discoveries, but they can only follow through on those ideas if they can work with their data in a reasonable cost effective way.

Finally, today's news is gratifying because it validates one of Geospiza's important early technology decisions. Cloud-computing, also called "Software as a Service" (SaaS) is not new to Geospiza. When we started in 1997, we made an important decision to develop our platform as a web-based system. We've been working with web technology and Internet-based services right from the beginning. Back then we used the term ASP (Application Service Provider) instead of SaaS, but we have had clients effectively using our systems this way for many years. Our long experience with cloud computing has prepared us to meet the new challenges created by NGS and we look forward to working with Applied Biosystems on the AWS platform to extend the number of options we can provide to our customers, helping to lower their computing costs, and working to enable their science.

For more information visit:

Geospiza Cloud
Press Release

FinchTalks where SaaS is discussed:

Have we been deluged?
Three Themes from AGBT and ABRF Part III: The IT Problem
Next Gen-Omics

Focus on Next Gen Sequencing
Closing 2008

Tuesday, April 21, 2009

What if dbEST was an NGS Experiment? Part I: dbEST

Back in 1997, this alarming statement appeared in a paper [1]:

“Biological research is generating data at an explosive rate. Nucleotide sequence databases along are growing at a rate of >210 million base pairs (bp)/year and it has been estimated that if the present rate of growth continues, by the end of the millennium the sequence databases will have grown to 4 billion bp!” [emphasis mine]

Imagine 4 billion bp of data - what would we do with all that?

The article was about the defunct Merck Gene Index browser, which was developed to make massive numbers of cDNA sequences, also called Expressed Sequence Tags (ESTs), available through a web-based system. The ESTs were being generated through the Merck Gene Index Project which was one of many public and private projects focused on collecting EST and full length cDNA sequences from human and model organism samples. The goal of these projects was to create data resources of transcript sequences for studying gene expression and later finding genes in genomic sequence data. Combined, these projects cost 10's of millions of dollars and spanned nearly a decade. They also produced millions of ESTs that are now stored in NCBI’s dbEST database [2].

And the prediction of GenBank’s growth was close, release 115 of GenBank (Dec, 1999) had 4.6 billion bases. With the most recent release, 9 years later, GenBank has grown to over 103 billion bases and some would say we are just getting started with sequencing DNA [3].

Today, for a few thousand dollars, a single run of an Illumina, SOLiD, or Helicos instrument can collect a greater amount of data than has ever been produced from all the EST projects combined. This begs the question, what would the data look like if dbEST was a Next Generation Sequencing (NGS) experiment?

A Brief History of dbEST

Before we get into comparing dbEST to a Next Generation DNA Sequencing (NGS) experiment, we should discuss what dbEST is and how it came to be. In the early days of automated DNA sequencing (ca. 1990) it was realized that cDNA, reverse transcribed from mRNA, could be partially sequenced and the resulting data could be used to measure which genes are expressed in a cell or tissue. The term EST was coined to describe the fact that each sequence corresponded to an mRNA molecule, and was in effect a “tag” for that molecule [4]. EST stands for Expressed Sequence Tag.

During the early years EST sequencing was controversial. Many proponents of the genome project felt that collecting ESTs would obviate the need for sequencing the entire genome and congress would end funding for the genome project before it was complete. Further controversy arose when NIH decided to patent several of the early brain ESTs. This news created an uproar in the community and led to the famous statement by one nobel laureate that automated sequencing machines “could be run by monkeys [5].”

ESTs also led to the founding of dbEST [2], a valuable resource for quickly assessing the functional aspects of the genome and later for identifying and annotating genes within genomic sequences. Today, EST projects continue to be worthwhile endeavors for exploring new organisms before full genome sequencing can be performed.

In the 15+ years since the founding of dbEST, the database has grown from 22,537 entries to approximately 61 million (4/17/2009). The first dbEST report contained ESTs from seven organisms. Today, over 1700 organisms are represented in dbEST. The species with the highest numbers of ESTs (> 1,000,000) include human, mouse, corn, pig, Arabidopsis, cow, zebrafish, soybean, Xenopus, rice, Ciona, wheat, and rat. More than half of the species however, have fewer than 10,000 ESTs. Since January of this year dbEST has grown by more than 2,000,000 entries.

Despite its value, dbEST, like many resources at the NCBI, requires an “expert” level of understanding to be useful. As classical clone-based cDNA sequencing gives way to more cost effective higher throughput methods like NGS, less emphasis will be placed on making this resource useful beyond maintaining the data as an archival resource that the community can access.

What this means is that when you visit the site, it does not look like much is there. You can get links to the original (closed access) papers and learn about how many sequences are present for each organism. Accession numbers, or gene names can used to look up a sequence and from other pages you can use BLAST to search the resource with a query sequence.

If you want to know more, you have to know how to look for the information and deal with it in the context in which it is presented. For example, I mentioned that dbEST has grown since January. I knew this because, I looked at the list of organisms and numbers of sequences then and now and noticed that more are reported now. However, to tell you where numbers have increased for which organisms, or whether new organisms have been added would require significant time and effort by either saving the different release reports or digging through the dbEST ftp site. When we return to the story, we’ll do some "ftp archealogy" and dig through dbEST records to begin characterizing the human ESTs.


1. Eckman B.A., Aaronson J.S., Borkowski J.A., Bailey W.J., Elliston K.O., Williamson A.R., Blevins R.A., 1998. The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining. Bioinformatics 14, 2-13.

2. Boguski M.S., Lowe T.M., Tolstoshev C.M., 1993. dbEST--database for “expressed sequence tags”. Nat Genet 4, 332-333. See also: http://www.ncbi.nlm.nih.gov/dbEST/

3. ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

4. Adams M.D., Kelley J.M., Gocayne J.D., Dubnick M., Polymeropoulos M.H., Xiao H., Merril C.R., Wu A., Olde B., Moreno R.F., 1991. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651-1656.
And http://www.genomenewsnetwork.org/resources/timeline/1991_Venter.php

5. http://www.nature.com/nature/journal/v405/n6790/full/405983b0.html

Tuesday, April 7, 2009

Have we been deluged?

In early March, Science published a perspective “Beyond the Data Deluge [1].” Last October, Nature Biotechnology (NBT) published an editorial “Prepare for the Deluge [2].” Did we miss the deluge?

Perhaps we are being deluged now

The articles' general themes were centered around the issue that scientists are having to work with more data than ever before. The NBT editorial focused on Next Generation DNA Sequencing (NGS) and covered many of the issues that we’ve also identified as important [3]. Selected quotes illustrate the challenges:
“Beyond the data management issues, significant computational infrastructure is also required.”

“In this respect, the open SOLiD software development community (http://solidsoftwaretools.com/gf/), which in July recruited two new software companies, is a step in the right direction.”

“For users, it would also be beneficial if data formats and the statistics for comparing performance of the different instruments could be standardized."

And one of my favorites: “At the moment, the cost of bioinformatics support threatens to keep the technology in the hands of the few. And this powerful technology clearly needs to be in the hands of many. If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow.”

We [biologists] are not alone.

Data-intensive science

The Science article presented a bigger picture. Written by luminaries (Gordon Bell, Tony Hey, and Alex Szalay) in the fields of computer science and astronomy, Bell, Hey, and Szalay point out that other fields like astronomy and particle physics have experiments generating petabytes of data per year. In bioinformatics, they noted how the extreme heterogeneity of data challenges scientists and how, in molecular biology, traditional hypothesis-led approaches are being replaced by a data-intensive inductive approaches. That is, collect a ton of data, observe patterns, and make discoveries. They go on to discuss how data “born digital” proliferate in files, spreadsheets, or databases and get scattered on hard drives and in Web sites, blogs, and wiki’s and how managing and curating these digital data is becoming increasingly burdensome. Ever have to move stuff around to make space, or spend too much time locating a file? What about your backups?

The article continues by discussing how data production is out pacing Moore’s Law (states that computer processors double in speed every 18 months or so), and how new kinds of clustered computing systems are needed to process the increasing flood of data. The late Jim Gray was a visionary in recognizing the need for new computer tools and defined a fourth research paradigm as data-intensive science.

Through the discussion a number of challenges to data-intensive science were presented. These are related to moving and accessing data, subtleties of organizing and defining data (also called schemas, ontologies, and semantics), and non-standard uses of technology. An example was cited where large database systems hold only pointers to files rather than the data within the files making direct analysis of the data impractical. A case study of astronomy community was provided to share how one discipline has successfully transitioned to data-intensive science.

In closing the article, the authors indicate that while, data-intensive science requires specialized skill and analysis tools, the research community should be optimistic because, today, more options exist for accessing storage and computing resources on demand through "cloud services" that are being built by the IT industry. Cloud services provide high-bandwidth cost-effective services, to provide computing capabilities that are beyond the financial scope of universities and national laboratories. According to the authors, this model offers an attractive solution to the scale problem, but requires radical thinking to make optimal use of these services. Bell and colleagues close with the following quote:
“In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies.”
OK, we're radical

Geospiza been helping labs and research groups meet current and future bioinformatics challenges through a number of creative approaches. For example, Nature Biotechnology lauded the SOLiD community program; Geospiza was one of those first two companies. Geospiza is also actively participating in the community through consortium actives like SEQC to develop standardized approaches to working with NGS systems and data. We have useful ways of analyzing and visualizing information in NGS data and our BioHDF project is tackling the deeper issues related to working with large amounts of extremely heterogeneous data and high performance computing so we can go beyond simply referencing files from databases and analyze these data more extensively than is currently possible. Finally, our GeneSifter products have used cloud computing from the very beginning and we have been successfully extending the cloud computing model to NGS.

References and additional reading

[1] 2008. Prepare for the deluge. Nat Biotechnol 26, 1099.
[2] Bell G., Hey T., Szalay A., 2009. Computer science. Beyond the data deluge. Science 323, 1297-1298.

[3] FinchTalks:
Journal Club: Focus on Next Gen Sequencing
The IT problem
The bioinformatics problem

BioHDF FinchTalks:
BOSC Abstract and Presentation
Introducing BioHDF
The case for HDF
Genotyping with HDF