Identification of a Set of Highly Constrained Genes from Exome Sequencing Data. K. E. Samocha1,2, E. B. Robinson1, B. M. Neale1,2, M. J. Daly1,2 1) Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, MA; 2) Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA.
With a rising number of studies identifying de novo mutations, it is important to be able to evaluate the findings, especially when prioritizing genes for further study. Since a mutation in a gene under evolutionary constraint may be more likely to contribute to disease, we sought to identify a set of such genes based on a large collection of exome sequence data.
We developed a sequence context based model of de novo mutations to create per-gene probabilities of mutation. We noticed a high correlation (0.94) between the probability of a synonymous mutation in a gene and the number of rare synonymous variants identified in that same gene first using the NHLBIs Exome Sequencing Project data (evs.gs.washington.edu), then with 25,000 exomes analyzed simultaneously (see abstract by MacArthur et al). We predicted the number of variants that we would expect to see in the dataset and, in order to quantify deviations, created a Z score of the chi-squared difference between observation and expectation for both synonymous and missense variation. While the distribution of these Z scores for the synonymous variants was normal, there is a marked shift in the missense distribution towards having fewer variants than predicted.
We identified a list of excessively constrained genes representing roughly 5% of all genes. This set of genes showed enrichment for entries in the Online Mendelian Inheritance in Man (OMIM) database. Roughly half of the top 41 constrained genes - for which deviation from the expected number of missense variants was significant at p 10-6 - have entries in OMIM with dominant or de novo inheritance patterns. By contrast, a set of genes for which the missense variants were very close to expectation (n = 235, -0.05 Z 0.05) had only 9 de novo or dominant inheritance entries in OMIM, which was significantly different than the number in the top 41 constrained genes (p 10-16).
This list of constrained genes showed significantly more overlap with genes containing a de novo loss of function mutation in both autism and intellectual disability (p 0.0001 for both), but not with those genes with de novo loss of function mutations in controls (p = 0.66), indicating that this approach can effectively prioritize genes in which mutations can strongly predispose to disease.
You may contact the first author (during and after the meeting) at