Thursday, April 22, 2010

Bloginar: RNA Deep Sequencing: Beyond Proof of Concept

RNA-Seq is a powerful method for measuring gene expression because you can use the deep sequence data to measure transcript abundance and also determine how transcripts are spliced and whether alleles of genes are expressed differentially.  

At this year’s ABRF (Association for Biomedical Research Facilities) conference, we presented a poster, using data from published study, to demonstrate how GeneSifter Analysis Edition (GSAE) can be used in next generation DNA sequencing (NGS) assays that seek to compare gene expression and alternative splicing between different tissues, conditions, or species.

The following map guides the presentation. The poster has a title and four main sections, which cover background information, introduction to the published work and data, ways to observe alternative splicing and global gene expression differences between samples, and ways to observe sex specific gene expression differences. The last section also identifies a mistake made by the authors.  


Section 1. The first section begins with the abstract and lists five specific challenges created by NGS: 1) high end computing infrastructures are needed to work with NGS data, 2) NGS data analysis involves complex multistep processes, 3) NGS data need to be compared to many reference databases, 4) the resulting datasets of alignments must be visualized in different ways, and 5) scientific knowledge is gained when several aligned datasets are compared. 

Next, we are reminded that NGS data are analyzed in three phases: primary analysis, secondary analysis and tertiary analysis. Primary analysis is the step that converts images to reads consisting of basecalls (or colors, or flowgrams), and quality values. In secondary analysis, reads are aligned to reference data (mapped) or amongst themselves (assembled). Secondary analysis produces tables of alignments that must be compared to one and other, in tertiary analysis, to gain scientific insights. 

Finally, GSAE is introduced as a platform for scalable data analysis. GSAE’s key features and advantages are listed along with several screen shots to show the many ways in which analyzed data can be presented to gain scientific insights.  

Section 2 introduces the RNA-Seq data used for the presentation. These data, from a study that set out to measure sex and lineage specific alternative splicing in primates [1], were obtained from the Gene Expression Omnibus (GEO) database at NCBI, transferred into GSAE, and processed through GSAE’s RNA-Seq analysis pipelines.  We chose this study because it models a proper expression analysis using replicated samples to compare different cases.

All steps of the process, from loading the data to processing the files and viewing results were executed through GSAE’s web-based interfaces. The four general steps of the process are outlined in the box labeled “Steps.” 

The section ends with screen shots from GSAE showing how the primary data can be viewed and a list of the reports showing different alignment results for each sample in the list. The reports are accessed from a “Navigation Panel” that contains links to Alignment Summaries, a Filter Report, and a Searchable Sortable Gene List (shown), and several other reports (not shown). 

The Alignment Summary provides information about the numbers of reads mapping to different reference data sources that are used in the analysis to understand sample quality and study biology. For example, in RNA-Seq, it is important to measure and filter reads matching ribosomal RNA (rRNA) because the amount of rRNA present indicates how well clean up procedures work. Similarly, the number of reads matching adaptors indicates how well the library was prepared. Biological, or discovery based, filters include reads matching novel exon junctions and intergenic regions of the genome.

Other reports like the Filter Report and Gene List provide additional detail. The Filter Report clusters alignments and plots base coverage (read density) across genomic regions. Some regions, like mitochondrial DNA, and rRNA genes, or transcripts, are annotated. Others are not. These regions can be used to identify areas of novel transcription activity.

The Gene List provides the most detail and gives a comprehensive overview of the number of reads matching a gene, numbers of normalized reads, and the counts of novel splices, single nucleotide variants (SNVs), and small insertions and deletions (indels). Read densities are plotted as small graphs to reveal each gene’s exon/intron structure. Additional columns provide the gene’s name, chromosome, and contain links to further details in Entrez. The graphs are linked to the Integrated Gene Viewer to explore the data further. Finally, the Gene LIst is an interactive report that can searched, sorted, and filtered in different ways, so you can easily view the details of your gene or chromosome of interest.

Section 3 shows how GSAE can be used to measure global gene expression and examine the details for a gene that is differentially expressed between samples. In the case of RNA-Seq, or exon arrays, relative exon levels can be measured to observe genes that are spliced differently between samples. The presented example focuses on the arginosuccinate synthetase 1 (ASS1) gene and compares the expression levels of its transcripts and exons between the six replicated human and primate male samples.

The Gene Summary report shows that ASS1 is down regulated by 1.38 times in the human samples. More interestingly, the exon usage plot shows that this gene is differentially spliced between the species. Read alignment data, supporting this observation, are viewed by clicking the “View Exon Data” link that is below the Exon Usage heat map. This link brings up the Integrated Gene Viewer (IGV) for all six samples. In addition to showing read densities across the gene, IGV also shows the numbers of reads that span exon junctions as loops with heights proportional to the number of reads mapping to a given junction. In these plots we see that the human samples are missing the second exon whereas the primate samples show two forms of the transcript. IGV also includes the Entrez annotations and known isoforms for the gene and the positions of known SNPs from dbSNP.  And, IGV is interactive; controls at the top of the report and regions within the gene map windows are used to navigate to new locations and zoom in or out of the data presented.  When multiple genes are compared, the data are updated for all genes simultaneously. 

Section 3 closes with heat map representing global gene expression for the samples being compared. Expression data are clustered using a 2-way ANOVA with 5% false discovery filter (FDR). The top half of the hierarchical cluster groups genes that are down regulated in humans and up regulated in primates and the bottom half groups genes that are expressed in the opposite fashion. The differentially expressed genes can also be viewed in Pathway Reports which show how many genes are up or down regulated in a particular Gene Ontology (GO) pathway.  Links in these reports navigate to lists of the individual genes or their KEGG pathways.  When a KEGG pathway is displayed, the genes that are differentially expressed are highlighted.
Section 4 focuses on the sex specific differences in gene expression between the human and primate samples. In this example, 12 samples are being compared: three replicates for the male and female samples of two species. When these data were reanalyzed in GSAE, we were able to note that an obvious mistake. By examining Y chromosome gene expression, it was clear that one of the human male samples (M2-2) was lacking expression of these genes. Similarly, when the X (inactive)-specific transcript (XIST) was examined, M2-2 showed high expression like the other female samples. The simplest explanations for these observations are that either M2-2 is a female, or a dataset was copied and mislabeled in GEO. However, given that the 12 datasets show subtle differences, it is likely that they are all different and the first explanation is more likely. 

The poster closes with a sidebar showing how GSAE can be used to measure global sequence variation and the take home points for the presentation. The most significant being that if the authors of the paper had used a system like GSAE, they could have quickly observed the problems in their data that we saw and prevented a mistake. 

To see how you can use GSAE for your data sign up for a trial

1. Sex-specific and lineage-specific alternative splicing in primates. Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y,. Genome Res. published online December 15, 2009

No comments: