FinchTalk: Next Generation Sequencing

Showing posts with label Next Generation Sequencing. Show all posts

Thursday, February 28, 2013

ABRF 2013

The annual Association of Biomedical Research Facilities begins this weekend (March 2 - 5). We [PerkinElmer] will be busy at the conference as participants and as a vendor supporting this great organization and our many customers. From client presentations to our own work we will share our latest and greatest.

Highlights:

Saturday: 3/2 "Breaking the Data Analysis Bottleneck: Solutions That Work for RNA and Exome Sequencing." Rebecca Laborde, Mayo Clinic will present on how the teams she works with use GeneSifter for they're NGS data analysis. This is part of Satellite Workshop 1: Applications of NGS.

Monday: 3/4 "Oyster Transcriptome Analysis by Next Gen Sequencing." Natalia Reyero, Genetics and Development Biology Center, NHLBI will give a presentation based on her award nominated poster (below) during the (RG5) Genomics Research Group. GeneSifter had a role in the data analysis.

Saturday and Monday Posters:

#7 "Identifying Mutations in Transcriptionally Active Regions on Genomes Using Next Generation Sequencing." Eric Olson, PerkinElmer, presents ways in which RNA-seq can be used to define transcripts to identify functional mutations in organisms that have sparsely annotated reference genomes.

#11 "What Does It Take to Identify the Signal from the Noise in Molecular Profiling of Tumors?" Eric Olson, PerkinElmer presents ways to use RNA sequencing and bioinformatic approaches to filter the vast numbers of variants observed in DNA sequence data obtained from tumors to a manageable number that are most likely to be the drivers of tumor growth.

Award Nominee
#119 "Elucidating the Effects of the Deepwater Horizon Oil Spill on the Atlantic Oyster Using Global Transcriptome Analysis" Natalia Reyero, Genetics and Development Biology Center, NHLBI. If you are interested in learning about the aftermath of the Gulf of Mexico oil spill, you will find Natalia's work interesting.

And that's not all. The the booth will be hopping. We will have meet the speaker opportunities on Sunday and Tuesday and as well as many demos of our GeneSifter Analysis and LIMS products. PerkinElmer Informatics will have individuals to show new features in PerkinElmer's Electronic Laboratory Notebook and other products, and Caliper and Chemagen reps will be on hand to talk about the great things we do for sample prep.

Check us out a Booth 522 to get schedules and see what's new.

Saturday, February 9, 2013

Genomics Genealogy Evolves

The ways massively parallel DNA sequencing can be used measure biological systems is only limited by imagination. In science, imagination is an abundant resource.

The November 2012 edition of Nature Biotechnology (NBT) focused on advances in DNA sequencing. It included a review by Jay Schendure and Eriz Lieberman Aiden entitled “The Expanding Scope of DNA Sequencing [1],” in which the authors provided a great overview of current and future sequencing-based assay methods with an interesting technical twist. It also made for an opportunity to update a previous Finchtalk.

As DNA sequencing moved from determining the order of nucleotide bases in single genes to the factory style efforts of the first genomes, it was limited to measuring ensembles of molecules derived from single clones or PCR amplicons as composite sequences. Massively parallel sequencing changed the game because each molecule in a sample is sequenced independently. This discontinuous advance resulted in a massive increase in throughput that created a brief, yet significant, deviation in the price performance curve that would be predicted from Moore’s law. It also created a level of resolution that makes it possible to collect data from populations of sequences and see how they vary in a quantitative fashion making it possible to use DNA sequencing as a powerful assay platform. While this was quickly recognized [2], reducing ideas to practice would take a few more years.

Sequencing applications fall into three three main branches: De Novo, Functional Genomics, and Genetics (figure below). The De Novo, or Exploratory branch contains three subbranches: new genomes, meta-genomes, or meta-transcriptomes. Genetics or variation assays form another main branch of the tree. Genomic sequences are compared within and between populations, individuals, or tissue and cells with the goal predicting a phenotype from differences between sequences. Genetic assays can focus on single nucleotide variations, copy number changes or structural differences. Determining inherited epigenetic modifications is another form of genetic assay.

Understanding the relationship between genotype and phenotype, however, requires that we understand phenotype in sufficient detail. In order for this to happen, traditional analog measurements such as height, weight, blood pressure, and disease descriptions need to be replaced with quantitative measurements at the DNA, RNA, protein, metabolism, and other levels. Within each set of “omes” we need to understand molecular interactions and the how the environmental factors such as diet, chemicals, and microorganisms impact these interactions positively or negatively and through modification of the epigenome. Hence, the Functional Genomics branch is fastest growing.

New assays since 2010 are highlighted in color and underlined text. See [1] for descriptions.

Functional Genomics experiments can be classified into five groups: Regulation, Epi-genomics, Expression, Deep Protein Mutagenesis, and Gene Disruption. Each group can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved). When experiments are refined and made reproducible, they become assays with sequence-based readouts.

In the paper, Shendure and Aiden describe 24 different assays. Citing an analogy to language where "Wilhelm von Humboldt described language as a system that makes ‘infi- nite use of finite means’: despite a relatively small number of words and combinatorial rules, it is possible to express an infinite range of ideas," the authors presented assay evolution as a assemblage of a small number of experimental designs. This model is not limited to language. In biochemistry a small number of protein domains and effector molecules are combined, and slightly modified, in different ways to create a diverse array of enzymes, receptors, transcription factors, and signaling cascades.

Subway map from [1]*.

Shendure and Aiden go on show how the technical domains can be combined to form new kinds of assays using a subway framework, where one enters via a general approach (comparison, perturbation, or variation) and reaches the final sequencing destination. Stations along the way are specific techniques that are organized by experimental motifs including cell extraction, nucleic acid extraction, indirect targeting, exploiting proximity, biochemical transformation, and direct DNA or RNA targeting.

The review focused on the bench and made only brief reference to the informatics issues as part of the "rate limiters" of next-generation sequencing experiments. It is important to note that each assay will have its own data analysis methodology. That may seem daunting. However, like the assays, the specialized informatics pipelines and other analyses can also be developed from a common set of building blocks. At Geospiza we are very familiar with these building blocks and how they can be assembled to analyze the data from many kinds of assays. As a result, the GeneSifter system is the most comprehensive in terms of its capabilities to support a large matrix of assays, analytical procedures, and species. If you are considering adding next-generation sequencing to your research or your current informatics is limiting your ability to publish, check out GeneSifter.

1. Shendure, J., and Aiden, E. (2012). The expanding scope of DNA sequencing Nature Biotechnology, 30 (11), 1084-1094 DOI: 10.1038/nbt.2421

2. Kahvejian A, Quackenbush J, and Thompson JF (2008). What would you do if you could sequence everything? Nature biotechnology, 26 (10), 1125-33 PMID: 18846086

* Rights obtained from Rightslink number 3084971224414

Sunday, January 27, 2013

Sneak Peek: Identifying Mutations in Expressed Regions of Genomes Using NGS

Join us Wednesday, January 30th at 1 PM (EST), 10 AM (PST) to learn how to use NGS to identify mutations in expressed regions of genomes.

Abstract:

The pace at which genome references are being generated for plants and animal species is rapidly increasing with Next Generation Sequencing technologies. While this is a major step forward for researchers studying species that previously did not have sequenced genomes, it is only the beginning of the process toward defining the biology underlying the genome. As long as a reference is available, DNA variants can be readily identified on a genome wide scale, often producing lists of 100s of thousands or even millions of variants. Frequently these variants that occur in expressed genes are of the most interest; however, if annotation defining where genes exist within a genome is not available or poorly defined, identifying which mutations might affect protein coding may not be possible. To address this challenge we will describe a method whereby RNA-Seq can be readily used to identify transcriptionally active regions which creates transcript annotation for un-annotated or enhanced annotation for any organism. This annotation can then be used in conjunction with whole genome sequencing to annotate variants as to whether they fall within transcriptionally active regions thus facilitating the identification of mutations in larger repertoire of expressed regions of a genome.

Eric Olson, Ph.D., and Hugh Arnold, Ph.D., from Geospiza will present.

Tuesday, December 4, 2012

Commonly Rare

Rare is the new common. The final month of the year is always a good time to review progress and think about what's next. In genetics, massively parallel next generation sequencing (NGS) technologies have been a dominating theme, and for good reason.

Unlike the previous high-throughput genetic analysis technologies (Sanger sequencing and microarrays), NGS allows us to explore genomes in far deeper ways and measure functional elements and gene expression in global ways.

What have we learned?

Distribution of rare and common variants. From [1]

The ENCODE project has produced a picture where a much greater fraction of the genome may be involved in some functional role than previously understood [1]. However, a larger theme has been related to observing rare variation, and trying to understand its impact on human health and disease. Because the enzymes that replicate DNA and correct errors are not perfect, each time a genome is copied a small number of mutations are introduced, on average between 35-80. Since sperm are continuously produced, fathers contribute more mutations than mothers, and the number of new mutations increases with the father's age [2]. While the number per child, with respect to their father's contributed three-billion base genome, is tiny, rare diseases and intellectual disorders can result.

A consequence is that the exponentially growing human population has accumulated a very large number of rare genetic variants [3]. Many of these variants can be predicted to affect phenotype and many more may modify phenotypes in yet unknown ways [4,5]. We are also learning that variants generally fall into two categories. They are either common to all populations or confined to specific populations (figure). More importantly, for a given gene the number of rare variants can vastly outnumber of the number of previously known common variants.

Another consequence of the high abundance of rare variation is how it impacts the resources that are used to measure variation and map disease to genotypes. For example, microarrays, which have been the primary tool of genome wide association studies utilize probes developed from a human reference genome sequence. When rare variants are factored in, many probes have several issues ranging from "hidden" variation within a probe to a probe simply not being able to measure a variant that is present. Linkage block size is also affected [6]. What this means it the best arrays going forward will be tuned to specific populations. It also means we need to devote more energy to developing refined reference resources, because the current tools do not adequately account for human diversity [6,7].

What's next?

Rare genetic variation has been understood for sometime. What's new is understanding just how extensive these variants are in the human population, which has resulted from the population recently rapidly expanding under very little selective pressure. Hence, linking variation to heath and disease is the next big challenge and the cornerstone of personalized medicine, or as some would like precision medicine. Conquering this challenge will require detailed descriptions of phenotypes, in many cases at the molecular level. As the vast majority of variants, benign or pathogenic, lie outside of coding regions we will need to deeply understand how those functional elements, as initially defined by ENCODE, are affected by rare variation. We will also need to layer in epigenetic modifications.

For the next several years the picture will be complex.

References:

1. 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), 56-65 PMID: 23128226

[2] Kong, A., et. al. (2012). Rate of de novo mutations and the importance of father’s age to disease risk Nature, 488 (7412), 471-475 DOI: 10.1038/nature11396

[3] Keinan, A., and Clark, A. (2012). Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants Science, 336 (6082), 740-743 DOI: 10.1126/science.1217283

[4] Tennessen, J., et. al. (2012). Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes Science, 337 (6090), 64-69 DOI: 10.1126/science.1219240

[5] Nelson, M., et. al. (2012). An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People Science, 337 (6090), 100-104 DOI: 10.1126/science.1217876

[6] Rosenfeld JA, Mason CE, and Smith TM (2012). Limitations of the human reference genome for personalized genomics. PloS one, 7 (7) PMID: 22811759

[7] Smith TM., and Porter SG. (2012) Genomic Inequality. The Scientist.

Sunday, April 22, 2012

Sneak Peak: A Practical Approach to Detecting Nucleotide Variants in NGS Data

Join us Thursday, May 3, 2012 9:00 am (Pacific Time) for a webinar on analyzing DNA sequencing data with hundreds of thousands to millions of nucleotide variants.

Description:
This webinar discusses DNA variant detection using Next Generation Sequencing for targeted and exome resequencing applications as well as, whole transcriptome sequencing. The presentation includes an overview of each application and its specific data analysis needs and challenges with a particular emphasis on variant detection methods and approaches for individual samples as well as multi-sample comparisons. For in depth comparisons of variant detection methods, Geospiza’s cloud-based GeneSifter® Analysis Edition software will be used to assess sample data from NCBI’s GEO and SRA.

For more information, please visit the registration page.

Tuesday, November 8, 2011

BioData at #SC11

Next week, Nov 12-18, Super Computing comes to Seattle. On Wed, Nov 15 at 12:15-1:15 pm, @finchtalk (me) will host a Birds-of-a-Feater session on "Technologies for Managing BioData" in room TCC305.

I'll kick off the session by sharing stories from Geospiza's work experiences and the work of others. If you have a story to share please bring it. The session will provide an open platform. We plan to cover relational databases, HDF5 technologies, and NoSQL. If you want to join in because you are interested in learning, the abstract below will give you an idea of what will be discussed.

Abstract:

DNA sequencing and related technologies are producing tremendous volumes of data. The raw data from these instruments needs to be reduced through alignment or assembly into forms that can be further processed to yield scientifically or clinically actionable information. The entire data workflow process requires multiple programs and information resources. Standard formats and software tools that meet high performance computing requirements are lacking, but technical approaches are emerging. In this BoF, options such as BAM, BioHDF, VCF and other formats, and corresponding tools, will be reviewed for their utility in meeting a broad set of requirements. The goal of the BoF is look beyond DNA sequencing and discuss the requirements for data management technologies that can integrate sequence data with data collected from other platforms such as quantitative PCR, mass spectrometry, and imaging systems. We will also explore the technical requirements for working with data from large numbers of samples.

Thursday, October 13, 2011

Personalities of Personal Genomes

"People say they want their genetic information, but they don’t." "The speaker's views of data return are frankly repugnant." These were some of the [paraphrased] comments and tweets expressed during Cold Spring Harbor's fourth annual conference entitled "Personal Genomes" held Sep 30 - Oct 2, 2011. The focus of which was to explore the latest technologies and approaches for sequencing genomes, exomes, and transcriptomes in the context of how genome science is, and will be, impacting clinical care.

The future may be close than we think

In previous years, the concept of personal genome sequencing as a way to influence medical treatment was a vision. Last year, the reality of the vision was evident through a limited number of examples. This year, several new examples were presented along with the establishment of institutional programs for genomic-based medicine. The driver being the continuing decreases in data collection costs combined with corresponding access to increasing amounts of data. According to Richard Gibbs (Baylor College of Medicine) we will have close to 5000 genomes completely sequenced by the end of this year and by the end of 2012, 30,000 complete genome sequences are expected.

The growth of genome sequencing is now significant enough that leading institutions are also beginning to establish guidelines for genomics-based medicine. Hence, an ethics panel discussion was held during the conference. The conversation about how DNA sequence data may be used has been an integral discussion since the beginning of the Genome Project. Indeed James Watson shared his lament for having to fund ethics research and directly asked the panel if they have done any good. There was a general consensus, from the panel, and audience members who have had their genomes sequenced, that ethics funding has helped by establishing genetic counseling and eduction practices.

However, as pointed out by some audience members, this ethics panel, like many others, focused too heavily on the risks for individuals and society having their genomic data. In my view, the discussion would have been more interesting and balanced if the panel included the individuals who are working outside of institutions with new approaches for understanding health. Organizations like 23andMe, Patients LIke Me, or the Genetic Alliance bring a very different and valuable perspective to the conversation.

Ethics was a fraction of the conference. The remaining talks at were organized into six sessions that covered personal cancer genomics, medically actionable genomics, personal genomes, rare diseases, and clinical implementations of personal genomics. The key messages from these presentations and posters was that, while genomics-based medical approaches have demonstrated success, much more research needs to be done before such approaches are mainstream.

For example, in the case of cancer genomics, whole genome sequences from tumor and normal cells can give a picture of point mutations and structural rearrangements, but these data need to be accompanied by exome sequences to get the high read depth needed to accurately detect the low levels of rare mutations that may be disregulating cell growth or conferring resistance to treatment. Yet, the resulting profiles of variants are still inadequate to fully understand the functional consequences of the mutations. For this, transcriptome profiling is needed, and that is just the start.

Once the data are collected they need to be processed in different ways, filtered, and compared within and between samples. Information from many specialized databases will be used in conjunction with statistical analyses to develop insights that can be validated through additional assays and measurements. Finally, a lab seeking to do this work, and return results back to patients, will also need to be certified, minimally by CLIA standards. For many groups this is significant undertaking, and good partners with experience and strong capabilities like PerkinElmer will be needed.

Further Reading

Nature Coverage, Oct 6 issue:

Secrets of the human genome disclosed

Nature readers flirt with genomics

Genomes on prescription
Other news and information:

At CSHL conference, researchers highlight importance of RNA-seq data to guide cancer treatment

Personal Genomes 2011 Meeting Site

Conference Tweets

Friday, June 10, 2011

Sneak Peak: NGS Resequencing Applications: Part I – Detecting DNA Variants

Join us next Wed. June 15 for a webinar on resequencing applications.

Description:

This webinar will focus on DNA variant detection using Next Generation Sequencing for the applications of targeted and exome resequencing as well as, whole transcriptome sequencing. The presentation will include an overview of each application and its specific data analysis needs and challenges. Topics covered will include Secondary Analysis (alignments, reference choices, variant detection) with a particular emphasis on DNA variant detection as well as multi-sample comparisons. For in depth comparisons of variant detection methods, Geospiza’s cloud-based GeneSifter Analysis Edition software will be used to assess sample data from NCBI’s GEO and SRA. The webinar will also include a short presentation on how these tools can be deployed for both individual researchers as well as through Geospiza’s Partner Program for NGS sequencing service providers.

Details:

Date and time: Wednesday, June 15, 2011 10:00 am
Pacific Daylight Time (San Francisco, GMT-07:00)
Wednesday, June 15, 2011 1:00 pm

Eastern Daylight Time (New York, GMT-04:00)

Wednesday, June 15, 2011 6:00 pm
GMT Summer Time (London, GMT+01:00)
Duration: 1 hour

Visit the webex site to register.

Tuesday, June 7, 2011

DOE's 2011 Sequencing, Finishing, Analysis in the Future Meeting

Cactus at Bandelier
National Monument

Last week, June 1-3, the Department of Energy held their annual Sequencing, Finishing, Analysis in the Future (SFAF) meeting in Santa Fe, New Mexico. SFAF, also sponsored b the Joint Genome Institute, and Los Alamos National Laboratory and was attended by individuals from the major genome centers, commercial organizations, and smaller labs.

In addition to standard presentations and panel discussions from the genome centers and sequencing vendors (Life Technologies, Illumina, Roche 454, and Pacific Biosciences), and commercial tech talks, this year's meeting included a workshop on hybrid sequence assembly (mixing Illumina and 454 data, or Illumina and PacBio data). I also presented recent work on how 1000 Genomes and Complete Genomics data are changing our thinking about genetics (abstract below).

John McPherson from the Ontario Cancer Research Institute (OICR, a Geospiza client) gave the kickoff keynote. His talk focused on challenges in cancer sequencing. One of those being that DNA sequencing costs are now predominated by instrument maintenance, sample acquisition, preparation, and informatics, which are never included in the $1000 genome conversation. OICR is now producing 17 trillion bases per month and as they, and others, learn about cancer's complexity, the idea of finding single biochemical targets for magic bullet treatments is becoming less likely.

McPherson also discussed how OICR is getting involved clinical cancer sequencing. Because cancer is a genetic disease, measuring somatic mutations and copy number variations will be best for developing prognostic biomarkers. However, measuring such biomarkers in patients in order to calibrate treatments requires a fast turnaround time between tissue biopsy, sequence data collection, and analysis. Hence, McPherson sees IonTorrent and PacBio as the best platforms for future assays. McPherson closed his presentation stating that data integration is the grand challenge. We're on it!

The remaining talks explored several aspects of DNA sequencing ranging from high throughput single cell sample preparation, to sequence alignment and de novo sequence assembly, to education and interesting biology. I especially liked Dan Distal's (New England Biolabs) presentation on the wood eating microbiome of shipworms. I learned that shipworms are actually little clams that use their shells as drills to harvest the wood. Understanding how the bacteria eat wood is important because we may be able to harness this ability for future energy production.

Finally, there was my presentation for which I've included the abstract.

What's a referenceable reference?

The goal behind investing time and money into finishing genomes to high levels of completeness and accuracy is that they will serve as a reference sequences for future research. Reference data are used as a standard to measure sequence variation, genomic structure, and study gene expression in microarray and DNA sequencing assays. The depth and quality of information that can be gained from such analyses is a direct function of the quality of the reference sequence and level of annotation. However, finishing genomes is expensive, arduous work. Moreover, in the light of what we are learning about genome and species complexity, it is worthwhile asking the question whether a single reference sequence is the best standard of comparison in genomics studies.

The human genome reference, for example, is well characterized, annotated, and represents a considerable investment. Despite these efforts, it is well understood that many gaps exist in even the most recent versions (hg19, build 37) [1], and many groups still use the previous version (hg18, build 36). Additionally, data emerging from the 1000 Genomes Project, Complete Genomics, and others have demonstrated that the variation between individual genomes is far greater than previously thought. This extreme variability has implications for genotyping microarrays, deep sequencing analysis, and other methods that rely on a single reference genome. Hence, we have analyzed several commonly used genomics tools that are based on the concept of a standard reference sequence, and have found that their underlying assumptions are incorrect. In light of these results, the time has come to question the utility and universality of single genome reference sequences and evaluate how to best understand and interpret genomics data in ways that take a high level of variability into account.

Todd Smith(1), Jeffrey Rosenfeld(2), Christopher Mason(3). (1) Geospiza Inc. Seattle, WA 98119, USA (2) Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA (3) Weill Cornell Medical College, New York, NY 10021, USA

Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, Hayden HS, Alkan C, Malig M, Ventura M, Giannuzzi G, Kallicki J, Anderson P, Tsalenko A, Yamada NA, Tsang P, Kaul R, Wilson RK, Bruhn L, & Eichler EE (2010). Characterization of missing human genome sequences and copy-number polymorphic insertions. Nature methods, 7 (5), 365-71 PMID: 20440878

You can obtain abstracts for all of the presentations at the SFAF website.

Friday, February 11, 2011

Variant Analysis and Sequencing Labs

Yesterday and two weeks ago, Geospiza released two important news items. The first announced PerkinElmer's section of the GeneSifter® Lab and Analysis systems to support their new DNA sequencing service. The second was an announcement of our new SBIR award to improve variant detection software.

Why are these important?

The PerkinElmer news is another validation of the fact that software systems need to integrate laboratory operations and scientific data analysis in ways that go deeper than sending computing jobs to a server (links below). To remain competitive, service labs can no longer satisfy customer needs by simply delivering data. They must deliver a unit of information that is consistent with the experiment, or assay, that is being conducted by their clients. PerkinElmer recognizes this fact along with our many other customers who participate in our partner program and work with our LIMS (GSLE) and Analysis (GSAE) systems to support their clients doing Sanger sequencing, Next Gen sequencing, or microarray analysis.

The SBIR news communicates our path toward making the units of information that go with whole genome sequencing, targeted sequencing, exome sequencing, allele analysis, and transcriptome sequencing as rich as possible. Through the project, we will add new applications to our analysis capabilities that solve the three fundamental problems related to data quality analysis, information integration, and the user experience.

Through advanced data quality analysis we can address the question, if we see a variant, can we understand wether the difference is random noise, a systematic error, or biological signal?
Information integration will help scientists and clinical researchers quickly add biological context to their data.
New visualization and annotation interfaces will help individuals explore the high-dimensional datasets resulting from the 100's and 1000's of samples needed to develop scientific and clinical insights.

Further Reading

Geospiza wins $1.2 M grant to add new DNA variant application to GeneSifter software.

PerkinElmer selects Geospiza for Next Generation Sequencing.

Keeping Your DNA Sequencing, Genotyping, and Microarray Laboratory Competitive in a New Era of Genomics .

Bloginar: Next Gen Laboratory Systems for Core Facilities

Wednesday, December 15, 2010

Genomics and Public Health: Epidemiology in Haiti

While there is debate and discussion about how genomics will be used in public health at a personalized medicine level, it is clear that rapid high-troughput DNA sequencing has immediate utility in epidemiological applications that seek to understand the origins of disease outbreaks.

In the most recent application, published this month, researchers at Pacific Biosciences (PacBio) used their PacBio RS sequencing system to identify the origins of the cholera outbreak in Haiti. According to the article, cholera as been present in Latin America since 1991, but is had not been epidemic in Haiti for at least 100 years. When the recent outbreak began in October this year, it was important to determine the origins of the disease, especially since it was concluded that Haiti had a low cholera risk following the earthquake. Understanding the origins of a disease can help define virulence and resistance mechanisms to guide therapeutic approaches.

Sequencing organisms to discover their origins in outbreaks is not new. What is new is the speed at which this can be done. For example, it took two months for the SARS virus to be sequenced after the epidemic started. In the recent work, the sequencing was completed in four days. And, it was not just one isolate that was sequenced, but two, with 40 times larger genomes.

When the Haiti sequences were compared to the sequences of 23 other V. cholera strains, the data indicated that the Haiti strain matched strains from South Asia more closely than the endemic strains from Latin America. This finding tells us that the stain was likely introduced, perhaps by aid workers. Additional sequence analysis of the colera toxin genes also confirmed that the strain causing the epidemic produces more severe disease. From a public health perspective this is important because the less virulent, easier to treat, endemic stains can be displaced by more aggressive strains. The good news is that the new strain is sensitive to tetracycline, a first line antibiotic.

The above work clearly demonstrates how powerful DNA sequencing is in strain identification. The authors favored single molecule sequencing on PacBio because its cycle time is shorter than second generation technologies like Illumina, SOLiD, and 454 and its long read lengths better handle repeats.While these points may be argued by the respective instrument vendors, it is clear is that we are entering an era where we can go very quickly from isolating infectious agents to classifying them at very high resolution in unprecedented ways. DNA sequencing will have a significant role in diagnosing infectious agents.

Further reading:

Scientists Trace Origin of Recent Cholera Epidemic in Haiti, HHMI News
The Origin of the Haitian Cholera Outbreak Strain, NEJM 2010

Wednesday, November 3, 2010

Samples to Knowledge

Today Geospiza and Ingenuity announced a collaboration to integrate our respective GeneSifter Analysis Edition (GSAE) and Ingenuity Pathway Analysis (IPA) software systems.

Why is this important?

Geospiza has always been committed to providing our customers the most complete software systems for genetic analysis. Our LIMS [GeneSifter Laboratory Edition] and GSAE have worked together to form a comprehensive samples to results platform. From core labs, to individual research groups, to large scale sequencing centers, GSLE is used for collecting sample information, tracking sample processing, and organizing the resulting DNA sequences, microarray files, and other data. Advanced quality reports keep projects on track and within budget.

For many years, GSAE has provided a robust and scalable way to scientifically analyze the data collected for many samples. Complex datasets are reduced and normalized to produce quantitative values that can be compared between samples and within groups of samples. Additionally, GSAE has integrated open-source tools like Gene Ontologies and KEGG pathways to explore the biology associated with lists of differentially expressed genes. In the case of Next Generation Sequencing, GSAE has had the most comprehensive and integrated support for the entire data analysis workflow from basic quality assessment to sequence alignment and comparative analysis.

With Ingenuity we will be able to take data-driven biology exploration to a whole new level. The IPA system is a leading platform for discovering pathways and finding the relevant literature associated with genes and lists of genes that show differential expression in microarray analysis. Ingenuity's approach focuses on combining software curation with expert review to create a state-of-the-art system that gets scientists to actionable information more quickly than conventional methods.

Through this collaboration two leading companies will be working together to extend their support for NGS applications. GeneSifter's pathway analysis capabilities will increase and IPA's support will extend to NGS. Our customers will benefit by having access to the most advanced tools for turning vast amounts of data into biologically meaningful results to derive new knowledge.

Samples to Results^TM becomes Samples to Knowledge^TM

Thursday, October 28, 2010

Bloginar: Making cancer transcriptome sequencing assays practical for the research and clinical scientist

A few weeks back we (Geospiza and Mayo Clinic) presented a research poster at BioMed Central’s Beyond the Genome conference. The objective was to present GeneSifter’s analysis capabilities and discuss the practical issues scientists face when using Next Generation DNA Sequencing (NGS) technologies to conduct clinically orientated research related to human heath and disease.

Abstract
NGS technologies are increasing in their appeal for studying cancer. Fully characterizing the more than 10,000 types and subtypes of cancer to develop biomarkers that can be used to clinically define tumors and target specific treatments requires large studies that examine specific tumors in 1000s of patients. This goal will fail without significantly reducing both data production and analysis costs so that the vast majority of cancer biologists and clinicians can conduct NGS assays and analyze their data in routine ways.

While sequencing costs are now inexpensive enough for small groups and individuals, beyond genome centers, to conduct the needed studies, the current data analysis methods need to move from large bioinformatics team approaches to automated methods that employ established tools in scalable and adaptable systems to provide standard reports and make results available for interactive exploration by biologists and clinicians. Mature software systems and cloud computing strategies can achieve this goal.

Poster Layout

Excluding the title, the poster has five major sections. The first section includes the abstract (above) and study parameters. In the work, we examined the RNA from 24 head and neck cancer biopsies from 12 individuals' tumor and normal cells.

The remaining sections (2-5), provide a background of NGS challenges, applications, high-level data analysis workflows, the analysis pipeline used in the work, the comparative analyses that need to be conducted, and practical considerations for groups seeking to do similar work. Much of section 2 has been covered in previous blogs and research papers.

Section 3: Secondary Analysis Explores Single Samples
NGS challenges are best known for the amount of data produced by the instruments. While this challenge should not be undervalued, it is over discussed. A far greater challenge lies in the complexity of data analysis. Once the first step (primary analysis, or basecalling) is complete, the resulting millions of reads must be aligned to several collections of reference sequences. For human RNA samples, these include the human genome, splice junction databases, and others to measure biological processes and filter out reads arising from artifacts related to sample preparation. Aligned data are further processed to create tables that annotate individual reads and compute quantitative values related to how the sample’s reads align (or cover) regions of the genome or span exon boundaries. If the assay measures sequence variation, alignments must be further processed to create variant tables.

Secondary analysis produces a collection of data in forms that can be immediately examined to understand overall sample quality and characteristics. High-level summaries indicate how many reads align to things we are interested in and not interested in. In GeneSifter, these summaries are linked to additional reports that show additional detail. Gene List reports, for example, show how the sample reads align within a gene’s boundary. Pictures in these reports are linked to Genesifter's Gene Viewer reports that provide even greater detail about the data with respect to each read’s alignment orientation and observed variation.

An important point about secondary analysis, however, is that it focuses on single sample analyses. As more samples are added to the project, the data from each sample must be processed through an assay specific pipeline. This point is often missed in the NGS data analysis discussion. Moreover, systems supporting this work must not only automate 100s of secondary analysis steps, they must also provide tools to organize the input and output data in project-based ways for comparative analysis.

Section 4: Tertiary Analysis in GeneSifter Compares Data Between Samples
The science happens in NGS when data are compared between samples in statistically rigorous ways. RNA sequencing makes it possible to compare gene expression, exon expression, and sequence variation between samples to identify differentially expressed genes, their isoforms, and whether certain alleles are differentially expressed. Additional insights are gained when gene lists can be examined in pathways and by ontologies. GeneSifter performs these activities in a user-friendly web-environment.

The poster's examples show how gene expression can be globally analyzed for all 24 samples, how a splicing index can distinguish gene isoforms occurring in tumor, but not normal cells, and how sequence variation can be viewed across all samples. Principal component analysis shows that genes in tumor cells are differentially expressed relative to normal cells. Genes highly expressed in tumor cells include those related to cell cycle and other pathways associated with unregulated cell growth. While these observations are not novel, they do confirm our expectations about the samples and being able to make such an observation with just a few clicks prevents working on costly misleading observations. For genes showing differential exon expression, GeneSifter provides ways to identify those genes and navigate to the alignment details. Similarly reports that show differential variation between samples can be filtered by multiple criteria in reports that link to additional annotation details and read alignments.

Section 5: Practical Considerations
Complete NGS data analysis systems seamlessly integrate secondary and tertiary analysis. Presently, no other systems are as complete as GeneSifter. There are several reasons why this is the case. First, a significant amount of software must be produced and tested to create such a system. From complex data processing automation, to advanced data queries, to user interfaces that provide interactive visualizations and easy data access, to security, software systems must employ advanced technologies and take years to develop with experienced teams. Second, meeting NGS data processing requirements demands that computer systems be designed with distributable architectures that can support cloud environments in local and hosted configurations. Finally, scientific data systems must support both predefined and ad hoc query capabilities. The scale of NGS applications means that non-traditional approaches must be used to develop data persistence layers that can support a variety of data access methods and, for bioinformatics, this is a new problem.

Because Geospiza has been doing this kind of work for over a decade and could see the coming challenges, we’ve focused our research and development in the right ways to deliver a feature rich product that truly enables researchers to do high quality science with NGS.

Enjoy the poster.

Wednesday, September 29, 2010

A Genomics Genealogy

Deep sequencing technologies have radically changed how we study biology. Deciding what technology and software to use can be daunting. Choices become easier when the relationships between different DNA sequencing applications are understood.

A brief history

DNA sequencing grew from our desire to understand how the instructions for the biochemistry of life are encoded in an organism’s DNA. If we know the precise ordering and organization of an organism’s DNA sequence, we can presumably unlock a code that reveals these instructions. Accomplishing this goal required the creation of a new field, molecular biology, and new technologies to sequence genes.

The first sequencing methods were arduous. They combined nuclease digestion with thin layer chromatography to measure di- and trinucleotides that could be puzzled together. Later, Maxim and Gilbert replaced enzymatic DNA degradation with a chemical fragmentation method that enabled the reading of ordered bases from ³²P labeled fragments separated by electrophoresis.

The Sanger method, which used dideoxynucleotide triphosphates to create ensembles of DNA molecules terminated at each base, soon replaced Maxim Gilbert sequencing. The next innovation was to color code DNA with fluorescent dyes so that molecules could be interrogated with a laser and camera coupled to a computer. This innovation automated “high-throughput” DNA sequencing systems, initially with polyacrylamide gels and later with capillary electrophoresis, and made it possible to sequence the human and other genomes. It also created the first transcriptome analysis method, Expressed Tag Sequencing (EST).

Despite 20 years of advances, however, the high-throughput sequencing methods were not high-enough-throughput to realistically interrogate DNA and RNA molecules in creative ways. Big questions (genomes, ESTs, meta-genomes) required large factory-like approaches to automate sample preparation and collect sequences because a fundamental problem had yet to be solved. Specially, each sequence was obtained from an individual purified DNA clone or PCR product.

Real high-throughput is massively parallel throughput

The next-generation DNA sequencing (NGS) technologies free researchers from the need to clone or purify every molecule. They all share the common innovation that DNA sequencing is performed in a massively parallel format. That is a library, or ensemble of millions of DNA molecules, are simultaneously sequenced. Data collection costs are dramatically decreased through miniaturization and by eliminating the need for warehouses of colony pickers, prep robots, sequencing instruments, and large teams of people.

The new problem is dealing with the data that are produced and increasing computation costs. As NGS opens new possibilities to measure DNA and RNA in novel ways, each application requires a specific laboratory procedure that must be coupled to a specific analysis methodology.

Sequencing genealogy is defined by the questions

In an evolutionary model, the history of cloning, restriction site mapping, and Sanger sequencing form the trunk of the genomics application tree (top figure) from which branches develop as new applications emerge.

NGS has driven the evolution of three main sequencing branches: De Novo, Functional Genomics, and Variation Assays. The De Novo, or Exploratory, sequencing contains three subbranches that include new genomes (projects that seek to determine a complete genome sequence of an organism), meta-genomes (projects in which DNA fragments are sequenced from environmental samples), or meta-transcriptomes (projects where cDNA fragments are sequenced from environmental samples).

The Functional Genomics branch is growing fast. In these experiments, different collections of RNA or DNA molecules from an organism, tissue, or cells, are isolated and sequenced to measure gene expression and how it is regulated. Three subbranches describe the different kinds of function genomics: Expression, Regulation, and EpiGenomics, and each of these subbranches can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved) that are defined by laboratory protocols, kits, and instruments. When the experiments are refined and are made reproducible, they become assays.

Variation Assays form the third main branch of the tree. Genomic sequences are compared within and between populations to link genotype and phenotype. In special cases like cancer and immunology research, variation assays are used to observe changes within an organism’s somatic genomes over time. Today, variation, or resequencing, assays measure nucleotide and small insertions and deletions in whole genomes and exomes. If linked sequence strategies (mate-pairs, paired-ends) are used, larger structural changes including copy number variations can also be measured.

Why is this important?

As a software provider with both deep lab and analysis experience, we [Geospiza] are often asked questions about what instrument platform is the best or how our software stacks up against other available options. The answer, of course, depends on what you want to do. De Novo applications benefit from long reads offered by platforms like 454. Many of the assay-based applications demand ultra-deep sequencing with very high numbers of sequences (reads) as provided by the short-read platforms (Illumina, SOLiD). New single molecule sequencing platforms like PacBio's are targeting a wide rage of applications but have best been demonstrated, thus far, for long-read uses and novel methylation assays.

From an informatics perspective, the exploratory and assay-based branches have distinct software requirements. Exploratory applications require that reads be assembled into contigs that must be further ordered into scaffolds to get to the complete sequence. In meta-genomics or meta-transcriptomics applications, data are assembled to obtain gene sequences. These projects are further complicated by orthologous and paralogous sequences and highly expressed genes that over represent certain sequences. In these situations, specialized hardware or complex data reduction strategies are needed to make assembly practical. Once data are assembled, they are functionally annotated in a second computational phase using tools like BLAST.

Assay-based data analysis also has two distinct phases, but they are significantly different from De Novo sequencing. The first phase involves aligning (or mapping) reads to reference data sources and then reducing the aligned data into quantitative values. At least one reference is required and the better it is annotated the more informative the initial results will be. Alignment differs from assembly in that reads are separately compared to a reference rather than amongst themselves. Alignment processing capacity can be easily scaled with multiple inexpensive computers whereas assembly processing cannot.

The second phase of Assay-based sequencing is to produce a discrete output as defined by a diagnostic application, or compare the quantitative values computed from the alignments from several samples, obtained from different individuals and (or) treatments relative to controls. This phase requires statistical tools to normalize data, filter false positives and negatives, and measure differences. Assay-based applications become more informative when large numbers of samples and replicates are included in a study.

Connecting the dots

While the sequencing applications can be grouped and summarized in different ways, they are also interrelated. For example, De Novo projects are open-ended and exploratory, but their end product, a well-annotated reference sequence, is the foundation for Functional Genomics and Variation applications. Variation analysis is only useful if we can assign function to specific genotypes. Functional assignments come, in part, from previous experiments and genomic annotations, but are increasingly being produced by sequencing assays, so the new challenge is integrating that data obtained from different assays into coherent datasets that can link many attributes to a set of genotypes.

NGS clearly opens new possibilities for studying and characterizing biological systems. Different applications require different sequencing platforms, laboratory procedures, and software systems that can organize analysis tools and automate data processing. On this last point, as one evaluates their projects and their options for being successful, they need to identify informatics groups that have deep experience, available solutions, and strong capabilities to meet the next challenges. Geospiza is one such group.

Further Reading

DNA Sequencing History

Gilbert W, Maxam A (1973) The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70:3581

Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74:560

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463-7

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321:674-9

Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cdna sequencing (expressed sequence tags) from a directionally cloned human infant brain cdna library. Nat Genet 4:373-80

International Human Genome Sequencing Consortium, 2001. “Initial sequencing and analysis of the human genome.” Nature 409, 860-921.

Venter J.C., Adams M.D., Myers E.W., et. al. 2001. “The sequence of the human genome.” Science 291, 1304-1351.

FinchTalks

From Reads to Datasets Why Next Gen is Not Like Sanger
Expeditiously Exponential: Genome Standards in a New Era

Next Gen DNA Sequencing Is Not Sequencing DNA
Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Saturday, September 11, 2010

The Interface Needs an Interface

A recent publication on Galaxy proposes that it is the missing graphical interface for genomics. Let’s find out.

The tag line of Michael Schatz’s article inGenome Biology states, “The Galaxy package empowers regular users to perform rich DNA sequence analysis through a much-needed and user-friendly graphical web interface.” I would take this description and Schatz’s later comment, “the ambitious goal of Galaxy is to empower regular users to carry out their own computational analysis without having to be an expert in computational biology or computer science” to mean that someone, like a biologist, who does not have much bioinformatics or computer experience could use the system to analyze data from a microarray or next gen sequencing experiment.

The Galaxy package is a software framework running bioinformatics programs and assembling those programs into complex pipelines referred to as workflows. It employs a web-interface and ships with a collection of tools to get a biologist quickly up and running, with examples. Given that Galaxy targets bioinformatics, it is reasonable to assume that its regular users are biologists. So, the appropriate question would be, how much does a biologist have to know about computers to use the system?

To test this question I decided to install the package. I have a Mac, and as a biologist, whether I’m on a Mac or PC, I expect that, if I’m given a download option, the software will easy to download and can be installed using a double click installer program. Galaxy does not have this. Instead, it uses a command line tool (oh, I need to use terminal) that requires Mercurial (hg). Hmm, what’s Mercurial? Mercurial is a version control system that supports distributed development projects. This not quite what I expected, but I’ll give it a try. I go to the hg (someone has a chemistry sense of humor) site and without too much trouble find a Mac OS X package, which uses a double click installer program. I’m in luck - of course I’ll ignore the you might have to add export LC_ALL=en_US.UTF-8, and export LANG=en_US.UTF-8 to your ~/.profile file - hg installs and works.

Now back to my terminal, I ignore the python version check and path setup commands, and type hg clone http://www.bx.psu.edu/hg/galaxy galaxy_dist; things happen. I follow the rest of the instructions - cd galaxy_dist; sh setup.sh - finally I start galaxy with the sh run.sh command. I go to my web browser and type http://localhost:8080 and galaxy is running! Kudos to the galaxy team for making a typically complicated process relatively simple. I’m also glad that I had none of the possible documented problems. However, to get this far, I had to tap into my unix experience.

With Galaxy running, I can now see if Schatz’s claims stand up. What should I do? The left hand menu gives me a huge number of choices. There are 31 categories that organize input/output functions, file manipulation tools, graphing tools, statistical analysis tools, analysis tools, NGS tools, and SNP tools, perhaps 200 choices of things to do. I’ll start with something simple like displaying the quality values in an Illumina NGS file. To do this, I click on “upload file” under the get data menu. Wow! There are 56 choices of file formats - and 17 have explanations. Fortunately there is an auto-detect. I leave that option, go the choose file button to select an NGS file on my hard drive and load it in. I’ll ignore the comment that files greater than 2GB should be uploaded by an http/ftp URL, because I don’t know what they are talking about. Instead I’ll make a small test file with a few thousand reads. I’ll also ignore the URL/text box and choice to convert spaces to tabs and the genome menu that seems to have hundreds of genomes loaded as these options have nothing to do with a fastq file. I’ll assume “execute” means “save” and click it.

After clicking execute some activity appears in the right hand menu indicating that my file is being uploaded. After a few minutes, my NGS file is in the system. To look at quality information, I select the “NGS: QC and manipulation” menu to find a tool. There are 18 options for tools to split files, join files, convert files, and convert quality data in files; this stuff is complicated. Since all I want to do is start with creating some summary statics, I find and select "FASTQ summary statistics." This opens a page in the main window where I can select the file that I uploaded and click the execute button to generate a big 20 column table that contains one row per base in the reads. The columns contain information about the frequency of bases and statistical values derived from the quality values in the file. These data are displayed in a text table that is hard to read, so the next step is to graphically view the data in histogram and box plots.

Graphing tools are listed under a different menu, “Graph/Display Data.” I like box plots, so I’ll select that choice. In the main window I select my summary stats file, create a title for the plot, set the plot’s dimensions (in pixels), define x and y axes titles, and select the columns from the big table that contains the appropriate data. I click the execute button to create files containing the graphs. Oops, I get an error message. It says “/bin/sh gnuplot command not found.” I have to install gnuplot. To get gnuplot going I have to download source, compile the package, and install. To do this I will need developer tools installed along with gnuplot’s other dependencies for image drawing. This is getting to be more work than I bargained for ...

When Schatz said “regular user” he must have meant unix savvy biologist that understands bioinformatics terminology, file formats, and other conventions, and can install software from source code.

Alternatively, I can upload my data into GeneSifter, select the QC analysis pipeline, navigate to the file summary page, and click the view results link. After all, GeneSifter was designed by biologists for biologists.

Wednesday, July 14, 2010

Increasing the Scale of Deep Sequencing Data Analysis with BioHDF

Last month, at the Department of Energy's Sequencing, Finishing and Analysis in the Future meeting, I presented Geospiza's product development work and how BioHDF is contributing to scalable infrastructures. The abstract, presentation, and link to the presentation are posted below.

Abstract

Next Generation DNA Sequencing (NGS) technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. Presently, the value of NGS technology has been largely demonstrated on individual sample analyses. The full potential of NGS will be realized when it can be used in multisample experiments that involve different measurements and include replicates, and controls to make valid statistical comparisons. Arguably, improvements in current technology, and soon to be available “third” generation systems, will make it possible to simultaneously measure 100’s to1000’s of individual samples in single experiments to study transcription, alternative splicing, and how sequences vary between individuals and within expressed genes. However, several bioinformatics systems challenges must be overcome to effectively manage both the volumes of data being produced and the complexity of processing the numerous datasets that will be generated.

Future bioinformatics applications need to be developed on common standard infrastructures that can reduce overall data storage, increase data processing performance, integrate information from multiple sources and are self-describing. HDF technologies meet all of these requirements, have a long history, and are widely used in data-intensive science communities. They consist of general data file formats, software libraries and tools for manipulating the data. Compared to emerging standards such as the SAM/BAM formats, HDF5-based systems demonstrate improved I/O performance and improvedmethods to reduce data storage. HDF5 isalso more extensible and can support multiple data indexes and store multiple data types. For these reasons, HDF5 and its BioHDF implementation are well qualified as standards for implementing data models in binary formats to support the next generation of bioinformatics applications. Through this presentation we will demonstrate BioHDF's latest features in NGS applications that target transcription analysis and resequencing.

SciVee Video

Acknowledgments

Contributing Authors: Todd Smith (1), Christopher E Mason (2), Paul Zumbo (2), Mike Folk (3), Dana Robinson (3), Mark Welsh (1), Eric Smith (1), N. Eric Olson (1),

1. Geospiza, Inc. 100 West Harrison N. Tower 330, Seattle WA 98119 2. Department of Physiology and Biophysics, Weil Cornell Medical College, 1305 York Ave., New York NY, 10021 3. The HDF Group, 1901 S. First St., Champaign IL 61820

Funding: NIH: STTR HG003792

Tuesday, June 29, 2010

GeneSifter Lab Edition 3.15

Last week we released GeneSifter Laboratory Edition (GSLE) 3.15. From NGS quality control data, to improved microarray support, to Sanger sequencing support, to core lab branding, and many others, there are a host of features and improvements for everyone that continue to make GSLE the leading LIMS for genetic analysis.

The three big features include QC Analysis of Next Generation Sequencing (NGS) Data and Microarrays, and core lab branding support.

To better troubleshoot runs, and view data quality for individual samples in a multiplex, the data within fastq, fasta, or csfasta (and quality) files are used to generate quality report graphics (figure below). These include the overall base (color) composition, average, per base, quality values (QVs), box and whisker plots showing median, lower and upper quartile, and minimum and maximum QVs at each base position, and error analysis indicating the number of QVs below 10, 20 and 30. A link is also provided to conveniently view the sequence data in pages, so that GBs of data do not stream into your browser.
For microarray labs, quality information from CHP and CEL files and a probe intensity data from CEL files are displayed. Please contact support@geospiza.com to activate of Affymetrix Settings and configure the CDF file path and power tools.
For labs that use their own ordering systems, a GSLE data view page has been created that can be embedded in a core lab website. To support end user access, a new user role, Data Viewer, has been created to limit access to only view folders and data within the user's lab group. Please contact support@geospiza.com to activate the feature.
The ability to create html tables in the Welcome Message for the Home page has been returned to provide additional message formatting capabilities.

Laboratory Operations
Several features and improvements introduced in 3.15 help prioritize steps, update items, improve ease of use, and enhance data handling.

Instrument Runs

A time/date stamp has been added to the Instrument Run Details page to simplify observing when runs were completed.
Partial Sanger (CE) runs can be manually completed (like NGS runs) instead of having to have all reactions be complete or fail the remaining.
The NGS directory view of result files now provides deletion actions (by privileged users) so labs can more easily manage disk usage.

Sample Handling

Barcodes can be recycled or reused for templates that are archived to better support labs using multiple lots of 2D barcode tubes. However, the template barcode field remains unique for active templates.
Run Ready Order Forms allow a default tag for the Plate Label to populate the auto-generated Instrument Run Name to make Sanger run set up quicker.
The Upload Location Map action has been moved to the side menu bar under Lab Setup to ease navigation.
The Template Workflows “Transition to the Next Workflow” action is now in English: “Enter Next Workflow.”
All Sanger chromatogram download options are easier to see and now include the option to download .phd formatted files.
The DNA template location field can be used to search for a reaction in a plate when creating a reaction plate.
To redo a Sanger reaction with a different chemistry, the chemistry can now be changed when either Requeuing for Reacting is chosen, or Edit Reactions from within a Reaction Set is selected.

Orders and Invoices
More efficient views and navigation have been implemented for Orders and Invoices.

When Orders are completed, the total number of samples and the number of results can be compared on the Update Order Status page to help identify repeated reactions.
A left-hand navigation link has been added for core lab customers to review both Submitted and Complete Invoices. The link is only active when invoicing is turned on in the settings.

System Management
Several new system settings now enable GSLE to be more adaptable at customer sites.

The top header bar time zone display can be disabled or configured for a unique time zone to support labs with customers in different time zones.
The User account profile can be configured to require certain fields. In addition, if Lab Group is not required, then Lab Groups are created automatically.
Projects within GSLE can be inactivated by all user roles to hide data not being used.

Application Programming Interface
Several additions to the self-documenting Application Programming Interface (API) have been made.

An upload option for Charge Codes within the Invoice feature was added.
Form API response objects are now more consistent.
API keys for user accounts can be generated in bulk.
Primers can be identified by either label or ID.
Events have been added. Events provide a mechanism to call scripts or send emails (beyond the current defaults) when system objects undergo workflow changes.

Presently, API's can only be activated on local, on-site, installations.

Thursday, April 22, 2010

Bloginar: RNA Deep Sequencing: Beyond Proof of Concept

RNA-Seq is a powerful method for measuring gene expression because you can use the deep sequence data to measure transcript abundance and also determine how transcripts are spliced and whether alleles of genes are expressed differentially.

At this year’s ABRF (Association for Biomedical Research Facilities) conference, we presented a poster, using data from published study, to demonstrate how GeneSifter Analysis Edition (GSAE) can be used in next generation DNA sequencing (NGS) assays that seek to compare gene expression and alternative splicing between different tissues, conditions, or species.

The following map guides the presentation. The poster has a title and four main sections, which cover background information, introduction to the published work and data, ways to observe alternative splicing and global gene expression differences between samples, and ways to observe sex specific gene expression differences. The last section also identifies a mistake made by the authors.

Section 1. The first section begins with the abstract and lists five specific challenges created by NGS: 1) high end computing infrastructures are needed to work with NGS data, 2) NGS data analysis involves complex multistep processes, 3) NGS data need to be compared to many reference databases, 4) the resulting datasets of alignments must be visualized in different ways, and 5) scientific knowledge is gained when several aligned datasets are compared.

Next, we are reminded that NGS data are analyzed in three phases: primary analysis, secondary analysis and tertiary analysis. Primary analysis is the step that converts images to reads consisting of basecalls (or colors, or flowgrams), and quality values. In secondary analysis, reads are aligned to reference data (mapped) or amongst themselves (assembled). Secondary analysis produces tables of alignments that must be compared to one and other, in tertiary analysis, to gain scientific insights.

Finally, GSAE is introduced as a platform for scalable data analysis. GSAE’s key features and advantages are listed along with several screen shots to show the many ways in which analyzed data can be presented to gain scientific insights.

Section 2 introduces the RNA-Seq data used for the presentation. These data, from a study that set out to measure sex and lineage specific alternative splicing in primates [1], were obtained from the Gene Expression Omnibus (GEO) database at NCBI, transferred into GSAE, and processed through GSAE’s RNA-Seq analysis pipelines. We chose this study because it models a proper expression analysis using replicated samples to compare different cases.

All steps of the process, from loading the data to processing the files and viewing results were executed through GSAE’s web-based interfaces. The four general steps of the process are outlined in the box labeled “Steps.”

The section ends with screen shots from GSAE showing how the primary data can be viewed and a list of the reports showing different alignment results for each sample in the list. The reports are accessed from a “Navigation Panel” that contains links to Alignment Summaries, a Filter Report, and a Searchable Sortable Gene List (shown), and several other reports (not shown).

The Alignment Summary provides information about the numbers of reads mapping to different reference data sources that are used in the analysis to understand sample quality and study biology. For example, in RNA-Seq, it is important to measure and filter reads matching ribosomal RNA (rRNA) because the amount of rRNA present indicates how well clean up procedures work. Similarly, the number of reads matching adaptors indicates how well the library was prepared. Biological, or discovery based, filters include reads matching novel exon junctions and intergenic regions of the genome.

Other reports like the Filter Report and Gene List provide additional detail. The Filter Report clusters alignments and plots base coverage (read density) across genomic regions. Some regions, like mitochondrial DNA, and rRNA genes, or transcripts, are annotated. Others are not. These regions can be used to identify areas of novel transcription activity.

The Gene List provides the most detail and gives a comprehensive overview of the number of reads matching a gene, numbers of normalized reads, and the counts of novel splices, single nucleotide variants (SNVs), and small insertions and deletions (indels). Read densities are plotted as small graphs to reveal each gene’s exon/intron structure. Additional columns provide the gene’s name, chromosome, and contain links to further details in Entrez. The graphs are linked to the Integrated Gene Viewer to explore the data further. Finally, the Gene LIst is an interactive report that can searched, sorted, and filtered in different ways, so you can easily view the details of your gene or chromosome of interest.

Section 3 shows how GSAE can be used to measure global gene expression and examine the details for a gene that is differentially expressed between samples. In the case of RNA-Seq, or exon arrays, relative exon levels can be measured to observe genes that are spliced differently between samples. The presented example focuses on the arginosuccinate synthetase 1 (ASS1) gene and compares the expression levels of its transcripts and exons between the six replicated human and primate male samples.

The Gene Summary report shows that ASS1 is down regulated by 1.38 times in the human samples. More interestingly, the exon usage plot shows that this gene is differentially spliced between the species. Read alignment data, supporting this observation, are viewed by clicking the “View Exon Data” link that is below the Exon Usage heat map. This link brings up the Integrated Gene Viewer (IGV) for all six samples. In addition to showing read densities across the gene, IGV also shows the numbers of reads that span exon junctions as loops with heights proportional to the number of reads mapping to a given junction. In these plots we see that the human samples are missing the second exon whereas the primate samples show two forms of the transcript. IGV also includes the Entrez annotations and known isoforms for the gene and the positions of known SNPs from dbSNP. And, IGV is interactive; controls at the top of the report and regions within the gene map windows are used to navigate to new locations and zoom in or out of the data presented. When multiple genes are compared, the data are updated for all genes simultaneously.

Section 3 closes with heat map representing global gene expression for the samples being compared. Expression data are clustered using a 2-way ANOVA with 5% false discovery filter (FDR). The top half of the hierarchical cluster groups genes that are down regulated in humans and up regulated in primates and the bottom half groups genes that are expressed in the opposite fashion. The differentially expressed genes can also be viewed in Pathway Reports which show how many genes are up or down regulated in a particular Gene Ontology (GO) pathway. Links in these reports navigate to lists of the individual genes or their KEGG pathways. When a KEGG pathway is displayed, the genes that are differentially expressed are highlighted.

Section 4 focuses on the sex specific differences in gene expression between the human and primate samples. In this example, 12 samples are being compared: three replicates for the male and female samples of two species. When these data were reanalyzed in GSAE, we were able to note that an obvious mistake. By examining Y chromosome gene expression, it was clear that one of the human male samples (M2-2) was lacking expression of these genes. Similarly, when the X (inactive)-specific transcript (XIST) was examined, M2-2 showed high expression like the other female samples. The simplest explanations for these observations are that either M2-2 is a female, or a dataset was copied and mislabeled in GEO. However, given that the 12 datasets show subtle differences, it is likely that they are all different and the first explanation is more likely.

The poster closes with a sidebar showing how GSAE can be used to measure global sequence variation and the take home points for the presentation. The most significant being that if the authors of the paper had used a system like GSAE, they could have quickly observed the problems in their data that we saw and prevented a mistake.

To see how you can use GSAE for your data sign up for a trial.

1. Sex-specific and lineage-specific alternative splicing in primates. Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y,. Genome Res. published online December 15, 2009