Improved exome prioritization of disease genes through cross species phenotype comparison. D. Smedley1, S. Köhler2,3, A. Oellrich1, K. Wang4, C. Mungall5, S. E. Lewis5, S. Bauer2,3, D. Seelow6, P. Krawitz2,3, C. Gilissen7, M. Haendel8, P. Robinson2,3,9, Sanger Mouse Genetics Project 1) Wellcome Trust Sanger Institute, Cambridge, Cambridgeshire, United Kingdom; 2) Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany; 3) Max Planck Institute for Molecular Genetics, Berlin, Germany; 4) Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, CA; 5) Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA; 6) Department of Neuropaediatrics, Charité-Universitätsmedizin Berlin, Berlin, Germany; 7) Department of Human Genetics, Nijmegen Centre for Molecular Life Sciences, Nijmegen, The Netherlands; 8) University Library and Department of Medical Informatics and Epidemiology, Oregon Health & Sciences University, Portland, OR; 9) Berlin Brandenburg Center for Regenerative Therapies, Charité-Universitätsmedizin Berlin, Berlin, Germany.

   Whole-exome sequencing has successfully identified over 100 new disease-gene associations in the last few years. However, many cases remain unsolved after exome sequencing. This is often due to the sheer number of candidate variants remaining after common filtering strategies such as removing low quality and common variants and those deemed non-pathogenic. The background level of ~100 genuine loss of function variants with ~20 genes completely inactivated in each of our genomes makes the identification of the causative mutation problematic when using these strategies alone. In some situations, further filtering may be possible by the use of multiple affected individuals, linkage data, identity-by-descent inference, identification of de novo heterozygous mutations from trio analysis, or prior knowledge of affected pathways. Where these strategies are not possible or have proved unsuccessful, we propose using an additional approach exploiting the wealth of genotype to phenotype data that already exists from model organism studies to assess the potential impact of these exome variants. We have developed an algorithm, PHenotypic Interpretation of Variants in Exomes (PHIVE), which integrates the calculation of phenotype similarity between human diseases and genetically modified mouse models with evaluation of the variants according to allele frequency, pathogenicity and mode of inheritance approaches. The approach can we used through our freely available web tool, Exomiser (http://www.sanger.ac.uk/resources/databases/exomiser). By large-scale validation using 100,000 exomes containing known disease associated mutations, we have demonstrated a substantial improvement (1.8-5.1 fold) over purely variant-based (frequency and pathogenicity) methods with the correct gene recalled as the top hit in up to 67% of samples, corresponding to an area under the ROC curve of over 95%. We conclude that incorporation of phenotype data can play a vital role in translational bioinformatics and propose that exome sequencing projects should systematically capture and utilize clinical phenotypes to take advantage of the strategy presented here.

You may contact the first author (during and after the meeting) at