Frequency Uniqueness Score: Predicting the Disease Risk of Coding Variants. A. C. Alexander1, B. E. Engelhardt2, 3, 4 1) Department of Computer Science, Duke University, Durham, NC; 2) Department of Biostatistics & Bioinformatics, Duke University, Durham, NC; 3) Department of Statistical Science, Duke University, Durham, NC; 4) Institute for Genome Science & Policy, Duke University, Durham, NC.

   In both clinical and research applications there is an acute need for a rapid assessment of the disease risk of non-synonymous amino acid variants from whole-genome or exome sequencing data, with classification of each variant based on whether it is pathogenic or functionally neutral. For example, improved classification accuracy that leads to early identification of a Mendelian disorder can have a meaningful impact on patient prognosis, and can expedite discoveries into possible therapies by focusing efforts research efforts on smaller sets of candidate variants. Existing computational approaches to variant classification all suffer from low overall accuracy rates, with recent performance comparisons showing accuracies of less than 70% for popular tools such as Polyphen 2 and SIFT, and accuracies of approximately 80% for the best-performing tools. Their poor performance limits their utility in the determination of disease causes. Here we present an approach that represents a major departure from previous methods that have relied primarily on cross-species conservation metrics and predicted protein structure impact. Leveraging recent large-scale population studies including the 1000 Genomes Project and the NHLBI Exome Sequencing Project, we use three simple human-specific classes of features including gene variation metrics, locus variational frequency, and a metric for prior gene-disease association. We combine these metrics to predict the probability of a variant being pathogenic using a random forest classifier, which allows us to model feature interactions and provides a measure of the importance of each feature in prediction. We demonstrate that our approach substantially outperforms existing state-of-the-art methods on a variety of performance measures, with overall accuracy rates in excess of 90% on all tested data sets, and accuracy of 98% using cross-validation on the Uniprot humsavar 2011_12 data set used to validate other methods. Unlike other approaches, our method can be applied to all coding variants including indel and splice site variants across all genes, and will naturally improve over time as more comprehensive estimates of human genetic variation become available. These convincing results open the door to automated pathogenicity risk assessment and context-dependent variant classification in the clinical setting.

You may contact the first author (during and after the meeting) at