Statistical model for the joint estimation of mRNA isoforms and individual-specific expression from RNA-seq data. F. Mordelet1, B. E. Engelhardt1,2,3 1) Institute for Genome Science and Policy, Duke University, DURHAM, NC; 2) Department of Biostatistics & Bioinformatics, Duke University, Durham, NC; 3) Department of Statistical Science, Duke University, Durham, NC.
There are around 21,000 transcribed genes in the human genome that encode somewhere between 250,000 and one million proteins in human cells. A gene in its pre-mRNA form consists of a chain of introns and exons. Ultimately, only exons are transcribed into mRNA, and introns are removed or ``spliced''; often one or several exons get spliced as well in a variety of different ways. Through this alternative splicing mechanism, the same gene may produce many different protein sequences, with often with distinct biological functions. Those different products of the same exon sequences are called isoforms. Along with the biological complexity added by alternative splicing comes a role in many human complex traits and disease, including cancer and HIV. In order to detect differentially spliced genes and genetic variants that regulate isoform transcription, it is crucial to correctly estimate individual-specific expression levels of each isoform. To achieve this, we propose a Bayesian nonparametric statistical framework to model RNA-sequencing data. This model, called a Hierarchical Dirichlet Process, views sequencing reads from an individual as random observations from a multinomial distribution associated with an unobserved isoform. Our model has the novel properties that i) it does not require the specification of the number of isoforms a priori but estimates this from the data; ii) it allows sharing of information across individuals, which is useful since isoforms may be shared across individuals; iii) it does not assume a uniform rate of transcription across isoform sites, where current models use a Poisson-based distribution with uniform rate assumptions. RNA-seq studies have shown the existence of sequence-specific and position-specific read count biases, causing the baseline distribution of reads to be considerably non-uniform along the isoform sequence. We use fast Markov chain Monte Carlo (MCMC) methods to jointly estimate both the unobserved read assignments to isoforms and individual-specific isoform levels. This allows us to efficiently map sequenced reads to isoforms, learning at the same time the set of isoforms that are expressed for a given gene and how much each isoform is expressed in each individual. These fitted models are then used to identify the set of isoforms across individuals for a particular cell type, test for differential expression of isoforms, and identify genetic variants associated with individual-specific isoform levels.
You may contact the first author (during and after the meeting) at