“Biological research is generating data at an explosive rate. Nucleotide sequence databases along are growing at a rate of >210 million base pairs (bp)/year and it has been estimated that if the present rate of growth continues, by the end of the millennium the sequence databases will have grown to 4 billion bp!” [emphasis mine]
Imagine 4 billion bp of data - what would we do with all that?
The article was about the defunct Merck Gene Index browser, which was developed to make massive numbers of cDNA sequences, also called Expressed Sequence Tags (ESTs), available through a web-based system. The ESTs were being generated through the Merck Gene Index Project which was one of many public and private projects focused on collecting EST and full length cDNA sequences from human and model organism samples. The goal of these projects was to create data resources of transcript sequences for studying gene expression and later finding genes in genomic sequence data. Combined, these projects cost 10's of millions of dollars and spanned nearly a decade. They also produced millions of ESTs that are now stored in NCBI’s dbEST database .
And the prediction of GenBank’s growth was close, release 115 of GenBank (Dec, 1999) had 4.6 billion bases. With the most recent release, 9 years later, GenBank has grown to over 103 billion bases and some would say we are just getting started with sequencing DNA .
Today, for a few thousand dollars, a single run of an Illumina, SOLiD, or Helicos instrument can collect a greater amount of data than has ever been produced from all the EST projects combined. This begs the question, what would the data look like if dbEST was a Next Generation Sequencing (NGS) experiment?
A Brief History of dbEST
Before we get into comparing dbEST to a Next Generation DNA Sequencing (NGS) experiment, we should discuss what dbEST is and how it came to be. In the early days of automated DNA sequencing (ca. 1990) it was realized that cDNA, reverse transcribed from mRNA, could be partially sequenced and the resulting data could be used to measure which genes are expressed in a cell or tissue. The term EST was coined to describe the fact that each sequence corresponded to an mRNA molecule, and was in effect a “tag” for that molecule . EST stands for Expressed Sequence Tag.
During the early years EST sequencing was controversial. Many proponents of the genome project felt that collecting ESTs would obviate the need for sequencing the entire genome and congress would end funding for the genome project before it was complete. Further controversy arose when NIH decided to patent several of the early brain ESTs. This news created an uproar in the community and led to the famous statement by one nobel laureate that automated sequencing machines “could be run by monkeys .”
ESTs also led to the founding of dbEST , a valuable resource for quickly assessing the functional aspects of the genome and later for identifying and annotating genes within genomic sequences. Today, EST projects continue to be worthwhile endeavors for exploring new organisms before full genome sequencing can be performed.
In the 15+ years since the founding of dbEST, the database has grown from 22,537 entries to approximately 61 million (4/17/2009). The first dbEST report contained ESTs from seven organisms. Today, over 1700 organisms are represented in dbEST. The species with the highest numbers of ESTs (> 1,000,000) include human, mouse, corn, pig, Arabidopsis, cow, zebrafish, soybean, Xenopus, rice, Ciona, wheat, and rat. More than half of the species however, have fewer than 10,000 ESTs. Since January of this year dbEST has grown by more than 2,000,000 entries.
Despite its value, dbEST, like many resources at the NCBI, requires an “expert” level of understanding to be useful. As classical clone-based cDNA sequencing gives way to more cost effective higher throughput methods like NGS, less emphasis will be placed on making this resource useful beyond maintaining the data as an archival resource that the community can access.
What this means is that when you visit the site, it does not look like much is there. You can get links to the original (closed access) papers and learn about how many sequences are present for each organism. Accession numbers, or gene names can used to look up a sequence and from other pages you can use BLAST to search the resource with a query sequence.
If you want to know more, you have to know how to look for the information and deal with it in the context in which it is presented. For example, I mentioned that dbEST has grown since January. I knew this because, I looked at the list of organisms and numbers of sequences then and now and noticed that more are reported now. However, to tell you where numbers have increased for which organisms, or whether new organisms have been added would require significant time and effort by either saving the different release reports or digging through the dbEST ftp site. When we return to the story, we’ll do some "ftp archealogy" and dig through dbEST records to begin characterizing the human ESTs.
1. Eckman B.A., Aaronson J.S., Borkowski J.A., Bailey W.J., Elliston K.O., Williamson A.R., Blevins R.A., 1998. The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining. Bioinformatics 14, 2-13.
2. Boguski M.S., Lowe T.M., Tolstoshev C.M., 1993. dbEST--database for “expressed sequence tags”. Nat Genet 4, 332-333. See also: http://www.ncbi.nlm.nih.gov/dbEST/
4. Adams M.D., Kelley J.M., Gocayne J.D., Dubnick M., Polymeropoulos M.H., Xiao H., Merril C.R., Wu A., Olde B., Moreno R.F., 1991. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651-1656.