FinchTalk: July 2008

Thursday, July 31, 2008

Questions from our mailbag: How do I cite FinchTV?

One of the questions that appears in our mailbox from time to time concerns citing FinchTV or other Geospiza products. A quick search with Google Scholar for "FinchTV" finds 42 examples where FinchTV was cited in research publications. Most of the citations seem to follow the same conventions.

We recommend citing FinchTV as you would any other experimental software tool, instrument, or reagent. The citation should include the version of the program, the company, the location, and the web site. Other Geospiza products (FinchLab, Finch Suite, and iFinch) may be cited in similar manner.

In our case, a citation would most likely read:

FinchTV 1.4.0 (Geospiza, Inc.; Seattle, WA, USA; http://www.geospiza.com)

If you're not sure which version of FinchTV you're using, open the About menu. The version number will appear on the page.

It would also be a good idea to check with the journal where you plan to submit the article. Most journals have a set of instructions for authors where they provide example citations.

Wednesday, July 30, 2008

BioHDF at BOSC

The scale of Next Gen sequencing is only going to increase, hence we need to fundamentally change the way we work with Next Gen data. New software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets used by the applications that analyze Next Gen DNA sequence data.

That was the theme of a talk I presented at the BOSC (Bioinformatics Open Source Conference) meeting that preceded ISMB (Intelligent Systems for Molecular Biology) in Toronto, Canada, July 19th. You can get the slides from the BOSC site. At the same time, we posted a blog on Genographia, a next-generation genomics community web site devoted to Next Gen sequencing discussions and idea sharing. The key points are summarized below.

Motivation

The BioHDF project is motivated by the fact that the next and future generations of data collection technologies, like DNA sequencing, are creating ever increasing amounts of data. Getting meaningful information from these data require that multiple programs be used in complex processes. Current practices for working with these data create many kinds of challenges, ranging from managing large numbers of files and formats to having the computation power and bandwidth to make calculations and move data around. These practices have a high cost in terms of storage, CPU, and bandwidth efficiency. In addition, they require significant human effort in understanding idiosyncratic program behavior and output formats

Is there a better way?

Many would agree that if we could reduce the number of file formats, avoid data duplication, and improve how we access and process data, we could develop better performing and more interoperable applications. Doing so requires improved ways of storing data and making it accessible to programs. For a number of years we have thought about these goals might be accomplished and looked to other data-intensive fields to see how others have solved these problems. Our search ended when we found HDF (hierarchical data format), a standard file format and library used in the physical and earth sciences.

BioHDF

HDF5 can be used in many kinds of bioinformatics applications. For specialized areas, like DNA sequencing, domain specific extensions will be needed. BioHDF is about developing those extensions, through community support, to create a file format and accompanying library of software functions that are needed to build the scalable software applications of the future. More will follow, if you are interested contact me: todd at geospiza.com.

Monday, July 14, 2008

Maq Attack

Maq (Mapping and Assembly with Quality) is an algorithm, developed at the Sanger center, for assembling Next Gen reads onto a reference sequence. Since Maq is widely used for working with Next Generation DNA sequence data, we chose to include support for Maq in our upcoming release of FinchLab. In this post, we will discuss integrating secondary analysis algorithms like Maq with the primary analysis and workflows in FinchLab.

Improving laboratory processes through immediate feedback

The cost to run Next Generation DNA sequencing instruments and the volume of data produced make it important for labs to be able to monitor their processes in real time. In the last post, I discussed how labs can get performance data and accomplish scientific goals during the three stages of data analysis. To quickly review: Primary data analysis involves converting image data to sequence data. Secondary data analysis involves aligning the sequences from the primary data analysis to reference data to create data sets that are used to develop scientific information. An example of a secondary analysis step would be assembling reads into contigs when new genomes are sequenced. Unlike the first two stages, where much of the data is used to detect errors and measure laboratory performance, the last stage is focused on the science. In the Tertiary data analyses genomes are annotated, and data sets are compared. Thus the tertiary analyses are often the most important in terms of gaining new insights. The data used in this phase must be vetted first. It must be high quality and free from systemic errors.

The companies producing Next Gen systems recognize the need to automate primary and secondary analysis. Consequently, they provide some basic algorithms along with the Next Gen instruments. Although these tools can help a lab get started, many labs have found that significant software development is needed on top of the starting tools if they are to fully automate their operation, translate output files into meaningful summaries, and give users easy access to the data. The starter kits from the instrument vendors can also be difficult to adapt when performing other kinds of experiments. Working with Next Gen systems typically means that you will have deal with a lot of disconnected software, a lack of user interfaces, and diverse new choices for algorithms when it comes to getting your work done.

FinchLab and Maq in an integrated system

The Geospiza FinchLab integrates analytical algorithms such as Maq into a complete system that encompasses all the steps in genetic analysis. Our Samples to Results platform provides flexible data entry interfaces to track sample meta data. The laboratory information management system is user configurable so that any kind of genetic analysis procedure can be run and tracked and most importantly provides tight linkage between samples, lab work, and their resulting data. This system makes it easy to transition high quality primary results to secondary data analysis.

One of the challenges with Next Gen sequencing has been choosing an algorithm for secondary analysis. Secondary data analysis needs to be adaptable to different technology platforms and algorithms for specialized sequencing applications. FinchLab meets this need because it can accommodate multiple algorithms when it comes to secondary and tertiary analysis. One of these algorithms is Maq. Maq attractive because it can be used in diverse applications where reads are aligned to a reference sequence. Among these are Transcriptomics (Tag Profiling, EST analysis, small RNA discovery), Promoter Mapping (CHiP-Seq, DNAase hypersensitivity), Methylation analysis, and Variation Analyses (SNP, CNV). Maq offers a rich set of output files so it can be used to quickly provide an overview of your data and help you verify that your experiment is on track before you invest serious time in tertiary work. Finally Maq is being actively developed and improved and is open-source so it is easy to access and use regardless of affiliation.

Maq and other algorithms are integrated into FinchLab through the FinchLab Remote Analysis Server (RAS). RAS is a lightweight job tracking system that can be configured to run any kind of program in different computing environments. RAS communicates with FinchLab to get the data and return the results. Data analyses are run in FinchLab by selecting the sequence file(s), clicking a link to go to a page and select the analysis method(s) and reference data sets, and then clicking a button to start the work. RAS tracks the details of data processing and sends information back to FinchLab so that you can always see what happening through the web interface.

A basic FinchLab system includes the RAS and pipelines for running Maq in two ways. The first is Tag Profiling and Expression Analysis. In this operation, Maq output files are converted to gene lists with links to drill down into the data and NCBI references. The second option it to use Maq in a general analysis procedure where all the output files are made available. In the next months, new tools will convert more of these files into output that can be added to genome browsers and other tertiary analysis systems.

A final strength of RAS is that it produces different kinds of log files to track potential errors. These kinds of files are extremely valuable in trouble-shooting and fixing problems. Since Next Gen technology is new and still in constant flux, you can be certain that unexpected issues will arise. Keeping the research on track is easier when informative RAS logging and reports help to diagnose and resolve issues quickly. Not only can FinchLab help with Next Gen assays, help solve those unexpected Next Gen problems, multiple Next Gen algorithms can be integrated into FinchLab to complete the story.