Integrated analysis of protein-coding variation in over 90,000 individuals from exome sequencing data. D. G. MacArthur1,2, M. Lek1,2,3,4, E. Banks2, R. Poplin2, T. Fennell2, K. Samocha1,2, B. Thomas1,2, K. Karczewski1,2, S. Purcell1,2,5, P. Sullivan6, S. Kathiresan1,2, M. I. McCarthy7, M. Boehnke8, S. Gabriel2, D. M. Altshuler1,2, G. Getz1,2, M. J. Daly1,2, Exome Aggregation Consortium 1) Massachusetts General Hospital, Boston, MA., USA; 2) Broad Institute of Harvard and MIT, Cambridge, MA, USA; 3) University of Sydney, Sydney, NSW, Australia; 4) Institute for Neuroscience and Muscle Research, Sydney, NSW, Australia; 5) Mt Sinai School of Medicine, New York, NY, USA; 6) University of North Carolina, Chapel Hill, NC, USA; 7) University of Oxford, Oxford, UK; 8) University of Michigan, Ann Arbor, MI, USA.
The discovery of genetic variation has been empowered by the growing availability of DNA sequencing data from large studies of common and rare diseases, but these data are typically inconsistently processed and largely inaccessible to most genetics researchers. We have developed an efficient and scalable pipeline for the joint analysis of exome sequencing data from tens of thousands of samples and have applied it to a collection of over 90,000 individuals sequenced in diverse population genetic and disease studies. Using extensive independent validation data we demonstrate that our joint variant calling approach improves accuracy, sensitivity and consistency of rare variant detection. Our results provide an unprecedented view of the spectrum of human functional genetic variation extending down to extremely low population frequencies. We observe >8 million single nucleotide polymorphisms (SNPs), including over 3.5 million rare (<1%) missense variants and >15,000 previously reported severe disease-causing mutations. We show that the frequency spectrum of rare variants can be used to assess the accuracy of functional annotation approaches, and to identify likely misannotated disease mutations. We describe the distribution of >150,000 predicted loss-of-function variants across human genes and the functional assessment of over 1,000 of these with independent RNA sequencing data. We also demonstrate the benefits of large joint-called reference panels for identifying gene regions subject to strong functional constraint and for the discovery of rare causal variants in both complex and Mendelian diseases. Finally, we announce the public release of observed variants, population frequencies and gene-level summary statistics for a subset of over 55,000 reference exomes. These summary results are publically available via an intuitive browser.
You may contact the first author (during and after the meeting) at