FinchTalk: January 2010

Sunday, January 31, 2010

Next Generation Sequencing: Deep and Fast!

The Next Generation Sequencing field is fast moving. A simple NGS search in PubMed already yields 51 articles for 2010. At his rate there will be over 600 papers this year that use NGS, review NGS progress, or introduce new NGS analysis methods.

The references retrieved from the search are listed below.

1: Marguerat S, Bahler J. RNA-seq: from technology to biology. Cell Mol Life Sci. 2010 Feb;67(4):569-79. Epub 2009 Oct 27. Review. PubMed PMID: 19859660.

2: Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ. Target-enrichment strategies for next-generationsequencing. Nat Methods. 2010 Feb;7(2):111-8. PubMed PMID: 20111037.

3: Jex AR, Hall RS, Littlewood DT, Gasser RB. An integrated pipeline for next-generation sequencing and annotation of mitochondrial genomes. Nucleic Acids Res. 2010 Feb;38(2):522-33. Epub 2009 Nov 5. PubMed PMID: 19892826; PubMed Central PMCID: PMC2811008.

4: Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Jan 28. [Epub ahead of print] PubMed PMID: 20110278.

5: Gottlieb B, Beitel LK, Alvarado C, Trifiro MA. Selection and mutation in the "new" genetics: an emerging hypothesis. Hum Genet. 2010 Jan 23. [Epub ahead of print] PubMed PMID: 20099069.

6: Monger WA, Alicai T, Ndunguru J, Kinyua ZM, Potts M, Reeder RH, Miano DW, Adams IP, Boonham N, Glover RH, Smith J. The complete genome sequence of the Tanzanian strain of Cassava brown streak virus and comparison with the Ugandan strain sequence. Arch Virol. 2010 Jan 22. [Epub ahead of print] PubMed PMID: 20094895.

7: Popp C, Dean W, Feng S, Cokus SJ, Andrews S, Pellegrini M, Jacobsen SE, Reik W. Genome-wide erasure of DNA methylation in mouse primordial germ cells is affected by AID deficiency. Nature. 2010 Jan 22. [Epub ahead of print] PubMed PMID: 20098412.

8: Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, Zhang Z, Zhang Y, Wang W, Li J, Wei F, Li H, Jian M, Li J, Zhang Z, Nielsen R, Li D, Gu W, Yang Z, Xuan Z, Ryder OA, Leung FC, Zhou Y, Cao J, Sun X, Fu Y, Fang X, Guo X, Wang B, Hou R, Shen F, Mu B, Ni P, Lin R, Qian W, Wang G, Yu C, Nie W, Wang J, Wu Z, Liang H, Min J, Wu Q, Cheng S, Ruan J, Wang M, Shi Z, Wen M, Liu B, Ren X, Zheng H, Dong D, Cook K, Shan G, Zhang H, Kosiol C, Xie X, Lu Z, Zheng H, Li Y, Steiner CC, Lam TT, Lin S, Zhang Q, Li G, Tian J, Gong T, Liu H, Zhang D, Fang L, Ye C, Zhang J, Hu W, Xu A, Ren Y, Zhang G, Bruford MW, Li Q, Ma L, Guo Y, An N, Hu Y, Zheng Y, Shi Y, Li Z, Liu Q, Chen Y, Zhao J, Qu N, Zhao S, Tian F, Wang X, Wang H, Xu L, Liu X, Vinar T, Wang Y, Lam TW, Yiu SM, Liu S, Zhang H, Li D, Huang Y, Wang X, Yang G, Jiang Z, Wang J, Qin N, Li L, Li J, Bolund L, Kristiansen K, Wong GK, Olson M, Zhang X, Li S, Yang H, Wang J, Wang J. The sequence and de novo assembly of the giant panda genome. Nature. 2010 Jan 21;463(7279):311-7. Epub 2009 Dec 13. PubMed PMID: 20010809.

9: Falgueras J, Lara AJ, Fernandez-Pozo N, Canton FR, Perez-Trabado G, Claros MG. SeqTrim: a high-throughput pipeline for preprocessing any type of sequence reads. BMC Bioinformatics. 2010 Jan 20;11(1):38. [Epub ahead of print] PubMed PMID: 20089148.

10: Beck AH, Weng Z, Witten DM, Zhu S, Foley JW, Lacroute P, Smith CL, Tibshirani R, van de Rijn M, Sidow A, West RB. 3'-end sequencing for expression quantification (3SEQ) from archival tumor samples. PLoS One. 2010 Jan 19;5(1):e8768. PubMed PMID: 20098735; PubMed Central PMCID: PMC2808244. 11: Webb KM, Rosenthal BM. Deep resequencing of Trichinella spiralis reveals previously un-described single nucleotide polymorphisms and intra-isolate variation within the mitochondrial genome. Infect Genet Evol. 2010 Jan 18. [Epub ahead of print] PubMed PMID: 20083232.

12: Galvez S, Diaz D, Hernandez P, Esteban FJ, Caballero JA, Dorado G. Next-Generation Bioinformatics: Using Many-Core Processor Architecture to Develop a Web Service for Sequence Alignment. Bioinformatics. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20081221.

13: Gilchrist E, Haughn G. Reverse genetics techniques: engineering loss and gain of gene function in plants. Brief Funct Genomic Proteomic. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20081218.

14: Lavoie PM, Dube MP. Genetics of bronchopulmonary dysplasia in the age of genomics. Curr Opin Pediatr. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20087186.

15: Hyten DL, Cannon SB, Song Q, Weeks N, Fickus EW, Shoemaker RC, Specht JE, Farmer AD, May GD, Cregan PB. High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics. 2010 Jan 15;11(1):38. [Epub ahead of print] PubMed PMID: 20078886.

16: Pleasance ED, Stephens PJ, O'Meara S, McBride DJ, Meynert A, Jones D, Lin ML, Beare D, Lau KW, Greenman C, Varela I, Nik-Zainal S, Davies HR, Ordonez GR, Mudie LJ, Latimer C, Edkins S, Stebbings L, Chen L, Jia M, Leroy C, Marshall J, Menzies A, Butler A, Teague JW, Mangion J, Sun YA, McLaughlin SF, Peckham HE, Tsung EF, Costa GL, Lee CC, Minna JD, Gazdar A, Birney E, Rhodes MD, McKernan KJ, Stratton MR, Futreal PA, Campbell PJ. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature. 2010 Jan 14;463(7278):184-90. Epub 2009 Dec 16. PubMed PMID: 20016488.

17: Myles S, Chia JM, Hurwitz B, Simon C, Zhong GY, Buckler E, Ware D. Rapid genomic characterization of the genus vitis. PLoS One. 2010 Jan 13;5(1):e8219. PubMed PMID: 20084295; PubMed Central PMCID: PMC2805708.

18: Santuari L, Pradervand S, Amiguet-Vercher AM, Thomas J, Dorcey E, Harshman K, Xenarios I, Juenger TE, Hardtke CS. Substantial deletion overlap among divergent Arabidopsis genomes revealed by intersection of short reads and tiling arrays. Genome Biol. 2010 Jan 12;11(1):R4. [Epub ahead of print] PubMed PMID: 20067627.

19: Pool J, Hellmann I, Jensen J, Nielsen R. Population genetics of genome-scale sequence variation. Genome Res. 2010 Jan 12. [Epub ahead of print] PubMed PMID: 20067940.

20: Duncan EL, Brown MA. Mapping genes for osteoporosis-Old dogs and new tricks. Bone. 2010 Jan 11. [Epub ahead of print] PubMed PMID: 20060943.

21: Byrne SL, Durandeau K, Nagy I, Barth S. Identification of ABC transporters from Lolium perenne L. that are regulated by toxic levels of selenium. Planta. 2010 Jan 9. [Epub ahead of print] PubMed PMID: 20063009.

22: Zhao CZ, Xia H, Frazier TP, Yao YY, Bi YP, Li AQ, Li MJ, Li CS, Zhang BH, Wang XJ. Deep sequencing identifies novel and conserved microRNAs in peanut (Arachis hypogaea L.). BMC Plant Biol. 2010 Jan 5;10(1):3. [Epub ahead of print] PubMed PMID: 20047695.

23: Medina M, Sachs JL. Symbiont genomics, our new tangled bank. Genomics. 2010 Jan 4. [Epub ahead of print] PubMed PMID: 20053372.

24: Hittinger CT, Johnston M, Tossberg JT, Rokas A. Leveraging skewed transcript abundance by RNA-Seq to increase the genomic depth of the tree of life. Proc Natl Acad Sci U S A. 2010 Jan 4. [Epub ahead of print] PubMed PMID: 20080632.

25: Volpi L, Roversi G, Colombo EA, Leijsten N, Concolino D, Calabria A, Mencarelli MA, Fimiani M, Macciardi F, Pfundt R, Schoenmakers EF, Larizza L. Targeted next-generation sequencing appoints c16orf57 as clericuzio-type poikiloderma with neutropenia gene. Am J Hum Genet. 2010 Jan;86(1):72-6. Epub 2009 Dec 10. PubMed PMID: 20004881.

26: Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437-55. PubMed PMID: 20059347.

27: Clement NL, Snell Q, Clement MJ, Hollenhorst PC, Purwar J, Graves BJ, Cairns
BR, Johnson WE. The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing. Bioinformatics. 2010 Jan 1;26(1):38-45. Epub 2009 Oct 27. PubMed PMID: 19861355.

28: Arner E, Hayashizaki Y, Daub CO. NGSView: an extensible open source editor for next-generation sequencing data. Bioinformatics. 2010 Jan 1;26(1):125-6. Epub 2009 Oct 24. PubMed PMID: 19855106; PubMed Central PMCID: PMC2796816.

29: Jex AR, Littlewood DT, Gasser RB. Toward next-generation sequencing of mitochondrial genomes--focus on parasitic worms of animals and biotechnological implications. Biotechnol Adv. 2010 Jan-Feb;28(1):151-9. Epub . Review. PubMed PMID: 19913084.

30: Jex AR, Gasser RB. Genetic richness and diversity in Cryptosporidium hominis
and C. parvum reveals major knowledge gaps and a need for the application of "next generation" technologies--research review. Biotechnol Adv. 2010 Jan-Feb;28(1):17-26. Epub . Review. PubMed PMID: 19699288.

31: Chou LS, Liu CS, Boese B, Zhang X, Mao R. DNA sequence capture and enrichment by microarray followed by next-generation sequencing for targeted resequencing: neurofibromatosis type 1 gene as a model. Clin Chem. 2010 Jan;56(1):62-72. Epub 2009 Nov 12. PubMed PMID: 19910506.

32: Nagalakshmi U, Waern K, Snyder M. RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol. 2010 Jan;Chapter 4:Unit 4.11.1-13. PubMed PMID: 20069539.

33: Fullwood MJ, Han Y, Wei CL, Ruan X, Ruan Y. Chromatin interaction analysis using paired-end tag sequencing. Curr Protoc Mol Biol. 2010 Jan;Chapter 21:Unit 21.15.1-25. PubMed PMID: 20069536.

34: Roukos DH. Novel clinico-genome network modeling for revolutionizing genotype-phenotype-based personalized cancer care. Expert Rev Mol Diagn. 2010 Jan;10(1):33-48. PubMed PMID: 20014921.

35: Liu S, Chen HD, Makarevitch I, Shirmer R, Emrich SJ, Dietrich CR, Barbazuk WB, Springer NM, Schnable PS. High-throughput genetic mapping of mutants via quantitative single nucleotide polymorphism typing. Genetics. 2010 Jan;184(1):19-26. Epub 2009 Nov 2. PubMed PMID: 19884313.

36: Day IN. dbSNP in the detail and copy number complexities. Hum Mutat. 2010 Jan;31(1):2-4. PubMed PMID: 20024941.

37: Hamady M, Lozupone C, Knight R. Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J. 2010 Jan;4(1):17-27. Epub 2009 Aug 27. PubMed PMID: 19710709; PubMed Central PMCID: PMC2797552.

38: Aparicio SA, Huntsman DG. Does massively parallel DNA resequencing signify the end of histopathology as we know it? J Pathol. 2010 Jan;220(2):307-15. PubMed PMID: 19921711.

39: Bell DW. Our changing view of the genomic landscape of cancer. J Pathol. 2010 Jan;220(2):231-43. PubMed PMID: 19918804.

40: Nobuta K, McCormick K, Nakano M, Meyers BC. Bioinformatics analysis of small RNAs in plants using next generation sequencing technologies. Methods Mol Biol. 2010;592:89-106. PubMed PMID: 19802591.

41: Salmon A, Ainouche ML. Polyploidy and DNA methylation: new tools available. Mol Ecol. 2010 Jan;19(2):213-5. PubMed PMID: 20078770.

42: Gathering clouds and a sequencing storm: why cloud computing could broaden community access to next-generation sequencing. Nat Biotechnol. 2010 Jan;28(1):1. PubMed PMID: 20062015.

43: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Jan;11(1):31-46. Epub 2009 Dec 8. Review. PubMed PMID: 19997069.

44: Northcott PA, Rutka JT, Taylor MD. Genomics of medulloblastoma: from Giemsa-banding to next-generation sequencing in 20 years. Neurosurg Focus. 2010 Jan;28(1):E6. PubMed PMID: 20043721.

45: Yang JH, Shao P, Zhou H, Chen YQ, Qu LH. deepBase: a database for deeply annotating and mining deep sequencing data. Nucleic Acids Res. 2010 Jan;38(Database issue):D123-30. Epub 2009 Dec 4. PubMed PMID: 19966272; PubMed Central PMCID: PMC2808990.

46: Shumway M, Cochrane G, Sugawara H. Archiving next generation sequencing data. Nucleic Acids Res. 2010 Jan;38(Database issue):D870-1. Epub 2009 Dec 3. PubMed PMID: 19965774; PubMed Central PMCID: PMC2808927.

47: Brooksbank C, Cameron G, Thornton J. The European Bioinformatics Institute's
data resources. Nucleic Acids Res. 2010 Jan;38(Database issue):D17-25. Epub 2009 Nov 24. PubMed PMID: 19934258; PubMed Central PMCID: PMC2808956.

48: Kim P, Yoon S, Kim N, Lee S, Ko M, Lee H, Kang H, Kim J, Lee S. ChimerDB 2.0--a knowledgebase for fusion genes updated. Nucleic Acids Res. 2010 Jan;38(Database issue):D81-5. Epub 2009 Nov 11. PubMed PMID: 19906715; PubMed Central PMCID: PMC2808913.

49: Leinonen R, Akhtar R, Birney E, Bonfield J, Bower L, Corbett M, Cheng Y, Demiralp F, Faruque N, Goodgame N, Gibson R, Hoad G, Hunter C, Jang M, Leonard S, Lin Q, Lopez R, Maguire M, McWilliam H, Plaister S, Radhakrishnan R, Sobhany S, Slater G, Ten Hoopen P, Valentin F, Vaughan R, Zalunin V, Zerbino D, Cochrane G. Improvements to services at the European Nucleotide Archive. Nucleic Acids Res. 2010 Jan;38(Database issue):D39-45. Epub 2009 Nov 11. PubMed PMID: 19906712; PubMed Central PMCID: PMC2808951.

50: Kaminuma E, Mashima J, Kodama Y, Gojobori T, Ogasawara O, Okubo K, Takagi T, Nakamura Y. DDBJ launches a new archive database with analytical tools for next-generation sequence data. Nucleic Acids Res. 2010 Jan;38(Database issue):D33-8. Epub 2009 Oct 22. PubMed PMID: 19850725; PubMed Central PMCID:
PMC2808917.

51: Jing H. [Advances in approaches for the quantitative detection of microRNAs.]. Yi Chuan. 2010 Jan;32(1):31-40. Chinese. PubMed PMID: 20085883.

Monday, January 25, 2010

Grant Opportunities for Next Generation DNA Sequencing

As we close the first month of 2010, it is time to get your pencils sharpened and submit proposals for new shared instruments.

The National Center for Research Resources (NCRR) announced that it has $43M to fund equipment purchases in 2011. With this money, NCRR expects to make approximately 125 new award for instruments that cost at least $100,000 but less than $600,000. NCRR proposals are due March 23, 2010.

In addition to NCRR, the National Science Foundation (NSF), through its Major Research Instrumentation (MRI) program, has $90M to make 150 awards of between $100,000 and $4M for shared instrumentation. MRI proposals are due April 21, 2010.

Remember, when preparing proposals a sound informatics plan will make your application stand out. Contact us if you’d like more information.

Monday, January 18, 2010

Systems Biology with HDF5

As many are aware, Geospiza and The HDF Group are collaborating to extend HDF (Hierarchical Data Format) technologies to support the data management needs of high performance computing applications in genomics. As we do this work, others are also adopting HDF5 as a data storage technology to work with different kinds of biological data.

The Association for Computing Machinery (ACM) recently published an article, "Unifying Biological Image Formats with HDF5," that argues for using HDF5 and HDF tools as a common framework for working with image files. This article is worth reading for several reasons.

First, it provides a nice introduction and background to HDF5, its origins, and movement towards becoming an ISO standard. HDF5's technical features are also included in this discussion.

Next, a brief history of the imaging community is covered to share how X-ray crystallographers, electron, and optical microscopists had all independently considered HDF5 as a framework for their next-generation image file formats. Through this discussion, the challenges that have been identified within the imaging community are listed.

Like genomics, the amounts of data being collected are ever increasing, current formats are inflexible and difficult to adapt to future modalities and dimensionality, and the nonarchival quality of data undermines long-term value. That is, current data typically lack sufficient metadata about their origins and experiments to be useful in the long-term.

The article goes on to make the point that current challenges with image data could be addressed if the community adopts an existing format that can support both generic and specialized data formats and meet a set of common requirements related to performance, interoperability, and archiving. Examples of how HDF5 meets these requirements are included. Briefly, HDF5's data caching can be used to overcome computation bottlenecks related to the fact that image sizes are exceeding RAM capacity. Interoperability issues can be addressed through HDF5's ability to store multiple metadata schemas in flexible ways. And, because HDF5 is self describing, data stored in HDF5 can be better preserved.

Finally, a barrier to moving to a new technology is supporting legacy applications that may be costly to replace. Thus, the article closes with a creative proposal for supporting legacy software applications and recommendations for future development. HDF5 files could support legacy software applications if they were able to present the data, stored within the HDF5 file, as the collection of directories and files required by the legacy application. This could be accomplished by developing an abstraction layer that could interact with FUSE (Filesystem in User Space) and essentially mount the HDF5 file as a virtual file system. Such a scenario is only possible because data are stored in HDF5 in a general way that can be further abstracted and presented in multiple specific ways.

While this article focused on issues related to image formats, there are many parallels that the genomics and Next Generation Sequencing communities should pay attention to, and if you are a bioinformatics software developer or running bioinformatics projects, you should put this paper on your must read list.

Wednesday, January 13, 2010

2010 sequencing starts in style

Next Generation Sequencing (NGS) is a hot topic. As we kick off 2010, many themes continue. Data throughput is increasing, sequencing costs are decreasing, and NGS still requires extensive informatics support.

Throughput up, costs down

As sequencing throughput increases, the costs for collecting sequencing data decrease. Illumina is setting the pace for 2010 by announcing its latest sequencing instrument, the HiSeq2000. Illumina’s press release, news reports, and the blogosphere enthusiastically report on the instrument’s five fold increase in data throughput and ability to sequence an entire human genome in about one week for about $10,000.

What about the informatics?

This month’s reviews and editorials in Nature Reviews Genetics (NRG) and Nature Biotechnology (NBT), respectively, claim that the most significant NGS challenge continues to be dealing with the data. As pointed out in the NRG editorial, it is quite possible that the community will produce more sequence data this year than has been cumulatively produced in the past 10 years. The HiSeq, developments that will be announced by Applied Biosystems in February, and the coming single molecule sequencers support this. The editorial further makes the point that genome centers have the computing infrastructure to deal with the data, but the larger community of researchers, who could benefit from these technologies, do not. A similar observation was made at the end of the NBT review which pointed out that costs associated with downstream handling and processing of the data will possibly equal or exceed data collection costs.

The significance of the informatics challenge is that wide adoption of NGS technologies assumes that we have usable solutions for working with the data. These solutions go beyond simply getting a computer cluster with a sequencing instrument. To be useful, that cluster needs to reside in an adequately air conditioned room, be operated by people who know how to work with cluster hardware and software and can also optimize networks to manage the flow of data. Other individuals are needed who can write programs and scripts to process the data, work with multiple database technologies, and develop scalable user interfaces to visualize and navigate through the results and compare information between multiple samples and experiments.

The conversation about the informatics problem began with the introduction of NGS technologies. In 2008, Nature Methods (July) and NBT (October) published editorials speaking to the coming challenges. Later in 2009, Science published a new article about data intensive science. Previous FinchTalks have discussed the articles and their significance and the theme has remained the same; both the access to computing technologies and the skills needed to use the data are unavailable to the large numbers of researchers who need to use these technologies to remain competitive.

There is a solution

One solution to the informatics challenge created by NGS, and other data intensive technologies, is to make use of the immense Internet-based computing infrastructure that has been created by companies like Amazon, Google, Yahoo, and others. Also called Cloud Computing, Internet-based services remove many of the hardware and infrastructure barriers for utilizing high performance computing and storage technology. This message was delivered by the 2010 NBT kick off editorial and accompanying news feature, along with the next important message that software solutions also need to be adapted to cloud environments. Here the editorial, like many other descriptions of NGS informatics needs, falls short in that they only focus on alignment programs. Simply adapting alignment algorithms using technologies like Hadoop to employ Cloud-based high performance computing clusters is not a sufficient solution.

Aligning billions of reads to reference data quickly and accurately is clearly important. However it is just the first step of a complex analysis process. The subsequent steps of analyzing the billions of alignments to filter artifacts, identify true and new variation between sequences, discover alternative splice forms in transcripts, and compare data between samples are even more challenging.

Fortunately Geospiza understands the problem well. As our tag line, From Samples to Results^TM, suggests, our lab and analysis systems focus on solving a complete set of problems that need to be addressed in order to do good science with NGS and other genetic analysis technologies.

Perhaps this is way we were the only software provider discussed in the NBT news feature, “Up in a cloud.”