Multi-sample isoform quantification from RNA-seq. A. E. Byrnes1,2, J. B. Maller1,2, A. R. Sanders3,4, J. Nemesh2, T. Sullivan2, H. H. Göring5, J. Duan3,4, W. Moy3, E. I. Drigalenko5, P. V. Gejman3,4, B. M. Neale1,2 1) Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA; 2) Broad Institute of MIT and Harvard, Cambridge, MA; 3) Department of Psychiatry and Behavioral Sciences, NorthShore University Health System, Evanston, IL; 4) Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, IL; 5) Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX.
Alternative splicing is critical for the regulation and diversity of the majority of human genes. (Wang et. al., 2008) The ability of a single gene to give rise to several diverse transcripts, and subsequently proteins, has been implicated in a wide range of processes and disorders from brain function to cancer proliferation. (David et. al., 2010; Blencowe, 2005) The recent advancements and price reduction in RNA-seq technologies presents an unprecedented opportunity to investigate splicing on a transcriptome-wide scale across many samples. However, the relatively short read-lengths do not provide perfect information as to which transcript isoforms are present. Here we present a systematic comparison between several existing methods for transcript assembly and quantification from RNA-seq data, including Cufflinks (Roberts, et. al., 2011), RSEM (Li and Dewy, 2011) and PSGInfer (LeGault and Dewy, 2013). We also propose a 2-step, multi-sample method for discovery and quantification of transcript isoforms (both known and novel) from paired-end RNA-seq data, while making use of a reference genome and any available annotation. Our method aims, first, to maximize information about splicing behavior by combining information from all aligned RNA-seq samples in order to construct a graph representing all possible transcripts, similar to approaches taken by PSGInfer and Cufflinks, but on all samples pooled together. In graph-building we weight each of the possible junctions between exons by the number of junction reads observed across all samples. We represent each isoform as a possible path through the graph and the use the weight of each edge as the initial probability in the following step. After constructing all likely isoforms from the data, we use the expectation-maximization algorithm to estimate their relative abundance, in addition to any known isoforms, similar to the methods applied in RSEM and eXpress. This second step allows us to specifically characterize the isoforms present in any individual and quantify their respective transcription for each sample separately. We will discuss the details of this method as well as the relative performance of all the above methods in real and simulated data. Our results have clear implications for the analysis of future work in alternative splicing.
You may contact the first author (during and after the meeting) at