FinchTalk: small RNA

Showing posts with label small RNA. Show all posts

Wednesday, April 6, 2011

Sneak Peak: RNA-Sequencing Applications in Cancer Research: From fastq to differential gene expression, splicing and mutational analysis

Join us next Tuesday, April 12 at 10:00 am PST for a webinar focused on RNA-Seq applications in breast cancer research.

The field of cancer genomics is advancing quickly. News reports from the annual American Association of Cancer Research meeting are indicating that whole genome sequencing studies such as the 50 breast cancer genomes (WashU) are providing more clues about the genes that may be affected in cancer. Meanwhile, the ACLU/Myriad Genetics legal action over genetic testing for breast cancer mutations and disease predisposition continues to move towards the supreme court.

Breast cancer, like many other cancers, is complex. Sequencing genomes is one way to interrogate cancer biology. However, the genome sequence data in isolation does not tell the complete story. The RNA, representing expressed genes, their isoforms, and non-coding RNA molecules, needs to be measured too. In this webinar, Eric Olson, Geospiza's VP of product development and principal designer of GeneSifter Analysis Edition, will explore the RNA world of breast cancer and present how you can explore existing data to develop new insights.

Abstract
Next Generation Sequencing applications allow biomedical researchers to examine the expression of tens of thousands of genes at once, giving researchers the opportunity to examine expression across entire genomes. RNA Sequencing applications such as Tag Profiling, Small RNA and Whole Transcriptome Analysis can identify and characterize both known and novel transcripts, splice junctions and non-coding RNAs. These sequencing based-applications also allow for the examination of nucleotide variant. Next Generation Sequencing and these RNA applications allow researchers to examine the cancer transcriptome at an unprecedented level. This presentation will provide an overview of the gene expression data analysis process for these applications with an emphasis on identification of differentially expressed genes, identification of novel transcripts and characterization of alternative splicing as well as variant analysis and small RNA expression. Using data drawn from the GEO data repository and the Short Read Archive, NGS Tag Profiling, Small RNA and NGS Whole Transcriptome Analysis data will be examined in Breast Cancer.

You can register at the webex site, or view the slides after the presentation.

Wednesday, September 29, 2010

A Genomics Genealogy

Deep sequencing technologies have radically changed how we study biology. Deciding what technology and software to use can be daunting. Choices become easier when the relationships between different DNA sequencing applications are understood.

A brief history

DNA sequencing grew from our desire to understand how the instructions for the biochemistry of life are encoded in an organism’s DNA. If we know the precise ordering and organization of an organism’s DNA sequence, we can presumably unlock a code that reveals these instructions. Accomplishing this goal required the creation of a new field, molecular biology, and new technologies to sequence genes.

The first sequencing methods were arduous. They combined nuclease digestion with thin layer chromatography to measure di- and trinucleotides that could be puzzled together. Later, Maxim and Gilbert replaced enzymatic DNA degradation with a chemical fragmentation method that enabled the reading of ordered bases from ³²P labeled fragments separated by electrophoresis.

The Sanger method, which used dideoxynucleotide triphosphates to create ensembles of DNA molecules terminated at each base, soon replaced Maxim Gilbert sequencing. The next innovation was to color code DNA with fluorescent dyes so that molecules could be interrogated with a laser and camera coupled to a computer. This innovation automated “high-throughput” DNA sequencing systems, initially with polyacrylamide gels and later with capillary electrophoresis, and made it possible to sequence the human and other genomes. It also created the first transcriptome analysis method, Expressed Tag Sequencing (EST).

Despite 20 years of advances, however, the high-throughput sequencing methods were not high-enough-throughput to realistically interrogate DNA and RNA molecules in creative ways. Big questions (genomes, ESTs, meta-genomes) required large factory-like approaches to automate sample preparation and collect sequences because a fundamental problem had yet to be solved. Specially, each sequence was obtained from an individual purified DNA clone or PCR product.

Real high-throughput is massively parallel throughput

The next-generation DNA sequencing (NGS) technologies free researchers from the need to clone or purify every molecule. They all share the common innovation that DNA sequencing is performed in a massively parallel format. That is a library, or ensemble of millions of DNA molecules, are simultaneously sequenced. Data collection costs are dramatically decreased through miniaturization and by eliminating the need for warehouses of colony pickers, prep robots, sequencing instruments, and large teams of people.

The new problem is dealing with the data that are produced and increasing computation costs. As NGS opens new possibilities to measure DNA and RNA in novel ways, each application requires a specific laboratory procedure that must be coupled to a specific analysis methodology.

Sequencing genealogy is defined by the questions

In an evolutionary model, the history of cloning, restriction site mapping, and Sanger sequencing form the trunk of the genomics application tree (top figure) from which branches develop as new applications emerge.

NGS has driven the evolution of three main sequencing branches: De Novo, Functional Genomics, and Variation Assays. The De Novo, or Exploratory, sequencing contains three subbranches that include new genomes (projects that seek to determine a complete genome sequence of an organism), meta-genomes (projects in which DNA fragments are sequenced from environmental samples), or meta-transcriptomes (projects where cDNA fragments are sequenced from environmental samples).

The Functional Genomics branch is growing fast. In these experiments, different collections of RNA or DNA molecules from an organism, tissue, or cells, are isolated and sequenced to measure gene expression and how it is regulated. Three subbranches describe the different kinds of function genomics: Expression, Regulation, and EpiGenomics, and each of these subbranches can be further divided into specific assay groups (DGE, RNA-Seq, small RNA, etc) that can be even further subdivided into specialized procedures (RNA-Seq with strandedness preserved) that are defined by laboratory protocols, kits, and instruments. When the experiments are refined and are made reproducible, they become assays.

Variation Assays form the third main branch of the tree. Genomic sequences are compared within and between populations to link genotype and phenotype. In special cases like cancer and immunology research, variation assays are used to observe changes within an organism’s somatic genomes over time. Today, variation, or resequencing, assays measure nucleotide and small insertions and deletions in whole genomes and exomes. If linked sequence strategies (mate-pairs, paired-ends) are used, larger structural changes including copy number variations can also be measured.

Why is this important?

As a software provider with both deep lab and analysis experience, we [Geospiza] are often asked questions about what instrument platform is the best or how our software stacks up against other available options. The answer, of course, depends on what you want to do. De Novo applications benefit from long reads offered by platforms like 454. Many of the assay-based applications demand ultra-deep sequencing with very high numbers of sequences (reads) as provided by the short-read platforms (Illumina, SOLiD). New single molecule sequencing platforms like PacBio's are targeting a wide rage of applications but have best been demonstrated, thus far, for long-read uses and novel methylation assays.

From an informatics perspective, the exploratory and assay-based branches have distinct software requirements. Exploratory applications require that reads be assembled into contigs that must be further ordered into scaffolds to get to the complete sequence. In meta-genomics or meta-transcriptomics applications, data are assembled to obtain gene sequences. These projects are further complicated by orthologous and paralogous sequences and highly expressed genes that over represent certain sequences. In these situations, specialized hardware or complex data reduction strategies are needed to make assembly practical. Once data are assembled, they are functionally annotated in a second computational phase using tools like BLAST.

Assay-based data analysis also has two distinct phases, but they are significantly different from De Novo sequencing. The first phase involves aligning (or mapping) reads to reference data sources and then reducing the aligned data into quantitative values. At least one reference is required and the better it is annotated the more informative the initial results will be. Alignment differs from assembly in that reads are separately compared to a reference rather than amongst themselves. Alignment processing capacity can be easily scaled with multiple inexpensive computers whereas assembly processing cannot.

The second phase of Assay-based sequencing is to produce a discrete output as defined by a diagnostic application, or compare the quantitative values computed from the alignments from several samples, obtained from different individuals and (or) treatments relative to controls. This phase requires statistical tools to normalize data, filter false positives and negatives, and measure differences. Assay-based applications become more informative when large numbers of samples and replicates are included in a study.

Connecting the dots

While the sequencing applications can be grouped and summarized in different ways, they are also interrelated. For example, De Novo projects are open-ended and exploratory, but their end product, a well-annotated reference sequence, is the foundation for Functional Genomics and Variation applications. Variation analysis is only useful if we can assign function to specific genotypes. Functional assignments come, in part, from previous experiments and genomic annotations, but are increasingly being produced by sequencing assays, so the new challenge is integrating that data obtained from different assays into coherent datasets that can link many attributes to a set of genotypes.

NGS clearly opens new possibilities for studying and characterizing biological systems. Different applications require different sequencing platforms, laboratory procedures, and software systems that can organize analysis tools and automate data processing. On this last point, as one evaluates their projects and their options for being successful, they need to identify informatics groups that have deep experience, available solutions, and strong capabilities to meet the next challenges. Geospiza is one such group.

Further Reading

DNA Sequencing History

Gilbert W, Maxam A (1973) The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70:3581

Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74:560

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463-7

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321:674-9

Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cdna sequencing (expressed sequence tags) from a directionally cloned human infant brain cdna library. Nat Genet 4:373-80

International Human Genome Sequencing Consortium, 2001. “Initial sequencing and analysis of the human genome.” Nature 409, 860-921.

Venter J.C., Adams M.D., Myers E.W., et. al. 2001. “The sequence of the human genome.” Science 291, 1304-1351.

FinchTalks

From Reads to Datasets Why Next Gen is Not Like Sanger
Expeditiously Exponential: Genome Standards in a New Era

Next Gen DNA Sequencing Is Not Sequencing DNA
Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Wednesday, February 3, 2010

Sneak Peak: Data Analysis Methods for Whole Transcriptome Sequencing Applications – Challenges and Solutions

RNA sequencing is one of the most popular Next Generation Sequencing (NGS) applications. Next Thursday, February 11 at 10:00 A.M. PDT (1:00 P.M. EDT), we kick off our 2010 webinar series with a presentation designed to help you understand whole transcriptome data analysis and what can be learned in these experiments. In addition, we will show off some of our latest tools and interfaces that can be used to discover new RNAs, new splice forms of transcripts, and alleles of expressed genes.

Summary

RNA sequencing applications such as Whole Transcriptome Analysis, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 500 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample (gene level expression summaries, exon usage, splice junction, single nucleotide variants, insertions and deletions), these applications are also ideal for the identification of novel RNAs as well as novel splicing events.

This presentation will provide an overview of Whole Transcriptome data analysis workflows with emphasis on calculating gene and exon level expression values as well as identifying splice junctions and variants from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and Lifetech’s SOLiD instruments.

Register Today!

Tuesday, May 12, 2009

Small RNAs Get Smaller

Tiny RNAs recently joined the growing list of non-coding RNA (ncRNA) molecules [1]. Their absolute function is not understood, but they are possibly a new class of ncRNA and appear to be most associated with transcription of highly expressed genes in human, chickens and Drosophila and possibly others.

This was the conclusion of work published in the May issue of Nature Genetics. Remember when all we had to worry about was the central dogma? DNA was transcribed into RNA and RNA was translated into protein. Life was so simple.

Not really. Even as the first genetic code was being elucidated [2], the possibility of uncovering the second code, that translates nucleic acid sequences into protein sequences was being contemplated [3]. Translating RNA into protein required other kinds of RNA that became known as ribosomal (rRNA) and transfer RNA (tRNA). The RNA between DNA and protein became messenger (m)RNA. In the late seventies, introns were discovered [4,5] and soon to follow were small nuclear (sn)RNAs and “snurps” (small nuclear riboproteins). The snRNAs were further characterized as small nucleolar (snoRNA) and Cajal body-specific (scaRNA) RNAs, and a class of new molecules were investigated for their involvement in mRNA splicing.

As the mechanisms for splicing were being worked out, researchers were able to prove that RNA could also be an enzyme [6]. In this case, the intron is the enzyme responsible for splicing itself out to create the mature mRNA. At the same time, another group discovered that the catalytic unit of RNAase-P, an enzyme involved in converting precursor tRNAs into active tRNAs, is also RNA [7]. Indeed, later work revealed that rRNA in the large ribosome subunit catalyzes the peptidyl transferase reaction to join amino acids together to build proteins [8]. Not only does the central dogma require a multitude of RNA molecules to transcribe DNA into RNA and translate RNA into protein, but the RNA molecules are responsible for carrying the information needed to make proteins and supplying the enzymatic activity to do the work!

What else does RNA do?

More than we can imagine. Starting with the discovery that double stranded RNA (dsRNA) could inhibit gene expression by turning on RNA interference (RNAi) pathways [9], new RNAs were identified, micro (miRNA) and small interfering (siRNA), as essential to the RNAi pathway. miRNA and siRNA were the early members of what would become a large and growing class of RNAs now referred to as non-coding RNAs (ncRNAs).

The ncRNAs represent a next frontier in RNA research and understanding gene expression. Some ncRNAs are large, like lincRNAs (large intervening non-coding RNAs) [10], but most are small between 18 and 31 nt. Within in the small ncRNA group are piwi-interacting (piRNA), repeat associated small interfering (rasiRNA), small temporal (stRNA), and now transcription initiation (tiRNA) RNA. I like tiny RNA.

Tiny, or tiRNAs, were discovered by Next Generation Sequencing (NGS) studies. RNA libraries were prepared from specific size fractions of capped messages. The resulting libraries were sequenced on the Roche FLX Genome Sequencing system and the data were aligned to human genome build 36.1 and compared to transcription start sites (TSS) defined by RefGene (NCBI). The authors reasoned the previous deep-sequencing studies missed these RNA molecules because they tend to be disregarded as low-abundance spurious, or degradation products. However, because they can be cloned, they must have a 5’ phosphate and, when aligned to genomc sequences, the NGS reads cluster in a non-random fashion around TSSs.

GeneSifter enables small RNA research

NGS makes it possible to explore the RNA world in new ways by designing experiments to capture small RNA molecules and sequence them in a massively parallel, high throughput format. However, both the experiments and data analysis are technically challenging. Fortunately GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE) can help. In GSLE you can use the software to track RNA preparation steps and record data at different points of the process. GSAE is accompanied with data analysis pipelines designed to filter artifacts and identify known small RNAs. Post alignment clustering reports, based on coverage in a genome, can be used to further refine results an discover new RNA species as well. Moreover, you can convert the clustering reports into lists of expression values for these RNAs and compare their expression between different samples, tissues, or experimental conditions.

References
1. Taft R.J., Glazov E.A., Cloonan N., et. al., 2009. Tiny RNAs associated with transcription start sites in animals. Nat Genet 41, 572-578.

2. Watson J.D. and Crick F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953)

3. Crick F.H., Barnett L., Brenner S., Watts-Tobin R.J., 1961. General nature of the genetic code for proteins. Nature 192, 1227-1232.

4. Chow L.T., Roberts J.M., Lewis J.B., Broker T.R., 1977. A map of cytoplasmic RNA transcripts from lytic adenovirus type 2, determined by electron microscopy of RNA:DNA hybrids. Cell 11, 819-836.

5. Berk A.J., Sharp P.A., 1977. Sizing and mapping of early adenovirus mRNAs by gel electrophoresis of S1 endonuclease-digested hybrids. Cell 12, 721-732.

6. Zaug A.J., Cech T.R., 1982. The intervening sequence excised from the ribosomal RNA precursor of Tetrahymena contains a 5-terminal guanosine residue not encoded by the DNA. Nucleic Acids Res 10, 2823-2838.

7. Guerrier-Takada C., Gardiner K., Marsh T., Pace N., Altman S., 1983. The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell 35, 849-857.

8. Nissen P., Hansen J., Ban N., Moore P.B., Steitz T.A., 2000. The structural basis of ribosome activity in peptide bond synthesis. Science 289, 920-930.

9. Fire A., Xu S., Montgomery M.K., Kostas S.A., Driver S.E., Mello C.C., 1998. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806-811.

10. Guttman M., Amit I., Garber M., French C., Lin M.F., Feldser D., Huarte M., Zuk O., Carey B.W., Cassady J.P., Cabili M.N., Jaenisch R., Mikkelsen T.S., Jacks T., Hacohen N., Bernstein B.E., Kellis M., Regev A., Rinn J.L., Lander E.S., 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223-227.

Further Reading
ncRNA - http://nar.oxfordjournals.org/cgi/reprint/35/suppl_1/D178
snoRNA - http://en.wikipedia.org/wiki/SnoRNA
siRNA - http://en.wikipedia.org/wiki/SiRNA
miRNA - http://en.wikipedia.org/wiki/MicroRNA
piRNA - http://en.wikipedia.org/wiki/Piwi-interacting_RNA
rasiRNA - http://en.wikipedia.org/wiki/RasiRNA
stRNA - http://jcs.biologists.org/cgi/content/full/116/23/4689
Ribozymes - http://en.wikipedia.org/wiki/Ribozyme

Databases
miRBASE - http://microrna.sanger.ac.uk/sequences/
RNAdb - http://research.imb.uq.edu.au/rnadb/default.aspx