mRNA and small RNA sequencing of 465 HapMap cell lines: the feasibility of multicenter RNA-seq studies. P. A. C. Hoen1, M. R. Friedlander2, J. Almlof3, M. Sammeth2,4, I. Pulyakhina1, S. Y. Anvar1,5, J. F. J. Laros1,5, O. Karlberg3, J. T. den Dunnen1,5, G. J. B. van Ommen1, I. G. Gut4, R. Guigo2, X. Estivill2, A. C. Syvanen3, E. T. Dermitzakis6,7,8, T. Lappalainen6,7,8, GEUVADIS consortium 1) Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands; 2) Centre for Genomic Regulation (CRG), Barcelona, Spain; 3) Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden; 4) Centro Nacional de Analisis Genomico (CNAG), Barcelona, Spain; 5) Leiden Genome Technology Center, Leiden University Medical Center, Leiden, the Netherlands; 6) Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland; 7) Institute for Genetics and Genomics in Geneva (iG3), University of Geneva, Geneva, Switzerland; 8) Swiss Institute of Bioinformatics, Geneva, Switzerland.

   RNA-sequencing is an increasingly popular technology for genome-wide analysis of transcript structure and abundance. However, the sources of technical and inter-laboratory variation have not been assessed in a systematic manner. To address this, seven centers of the GEUVADIS consortium sequenced mRNAs and small RNAs of 465 HapMap lymphoblastoid cell lines (LCLs) for which the full genome sequence was available from the 1000Genomes consortium. Five samples were sequenced in every center and 168 samples were sequenced in two centers. When comparing individual LCLs, the biological variation is limited. Nevertheless, the five samples that were sequenced in each laboratory clustered by sample and not by laboratory. The clustering by sample was much stronger for exon quantifications than for transcript quantifications. When investigated further, laboratory differences mainly manifested in the average GC-percentage, the width of the distribution of GC-percentages and the insert sizes. A similar analysis was performed for small RNA sequencing. Again, the replicates sequenced in all laboratories grouped by samples rather than laboratories. Clustering divided the samples into those dominated by miRNA and those dominated by rRNA. The proportions of miRNA and rRNA reads were more similar within samples than within laboratories. The miRNA contents clearly varied between RNA extraction batches. Therefore, differences in relative miRNA/rRNA contents are likely introduced during RNA isolation, before the samples were distributed across the laboratories. The heterogeneity in small RNA contents did not bias the relative quantification of individual miRNAs. In conclusion, distributed RNA-sequencing appears to be feasible. It is particularly attractive for large population-based and cross-biobank studies, where sequencings costs and sample logistics may require combination of data from individual studies and laboratories. The combined sequencing data from this project significantly extended our understanding of the genetic basis of transcriptome variation and generated an unprecedented resource of genomic variants affecting expression (eQTLs), splicing, and transcription start site and polyadenylation site usage.

You may contact the first author (during and after the meeting) at