Integrated analysis of protein-coding variation in over 50,000 individuals. M. Lek1,2, D. G. MacArthur1,2, A. Levy Moonshine2, M. Rivas3, S. Purcell1,4, P. Sullivan5, S. Kathiresan1,2, M. I. McCarthy3, M. Boehnke6, S. Gabriel2, D. M. Altshuler2, G. Getz1,2, M. J. Daly1,2, M. A. DePristo2, Exome Aggregation Consortium 1) Massachusetts General Hospital, Boston, MA; 2) Broad Institute of Harvard and MIT, Cambridge, MA; 3) University of Oxford, Oxford, UK; 4) Mt Sinai School of Medicine, New York, NY; 5) University of North Carolina, Chapel Hill, NC; 6) University of Michigan, Ann Arbor, MI.

   The increasing availability of DNA sequencing data has empowered variant discovery in studies of both common and rare diseases. However, for these data to provide maximum utility it will be critical to generate consistent variant calls across tens of thousands of samples.
   We have assembled and jointly analyzed exome sequencing data from a collection of over 55,000 individuals sequenced as part of a variety of population genetic and disease-specific studies, an approach enabled by the development of new compressed file formats and variant-calling algorithms. We demonstrate that joint calling substantially improves the accuracy, sensitivity and consistency of variant detection. In particular we highlight the benefits of the creation of very large joint-called sets of cases and controls for detecting rare causal variants in both complex and Mendelian diseases.
   Our results provide a view of the spectrum of human functional genetic variation extending down to extremely low population frequencies. We describe the frequency and genomic distribution of human protein-coding genetic variation, and show that the frequency spectrum of rare variants can be used to assess the accuracy of functional annotation approaches and to identify genes more likely to harbor severe disease-causing mutations. We also report the distribution of predicted loss-of-function (LoF) variants across human genes, their validation with independent RNA sequencing data, and their application in candidate gene prioritization for severe disease.
   Finally, we present new genotyping arrays containing the majority of protein-coding LoF variants and reported disease-causing mutations at an appreciable frequency in our cohort, empowering cost-effective association studies of rare, likely functional genetic variation and direct estimates of the penetrance of reported disease mutations in large, phenotyped cohorts.

You may contact the first author (during and after the meeting) at