Beware of circularity: A critical assessment of the state of the art in deleteriousness prediction of missense variants. C. A. Azencott1,2,3,4, D. Grimm4,5, J. W. Smoller6,7,8, L. Duncan10,9,6, K. Borgwardt4,5 1) MINES ParisTech, PSL Research University, Centre for computational biology, 77300 Fontainebleau, France; 2) Institut Curie, 75248 Paris Cedex 05, France; 3) INSERM, U900, 75248 Paris Cedex 05, France; 4) Machine Learning and Computational Biology research group, Max Planck Institute for Intelligent Systems and Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany; 5) Zentrum fuer Bioinformatik, Eberhard Karls Universität Tübingen, 72076 Tübingen, Germany; 6) Broad Institute of MIT and Harvard, Cambridge, MA; 7) Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, MA; 8) Harvard Medical School, Department of Psychiatry, Boston, MA; 9) Harvard Medical School, Department of Medicine, Boston, MA; 10) Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA.

   Discrimination between disease-causing missense mutations and neutral polymorphisms is a key challenge in current sequencing studies. It is therefore critical to be able to evaluate fairly and without bias the performance of the many in silico predictors of deleteriousness. However, current analyses of such tools and their combinations are liable to suffer from the effects of circularity, which occurs when predictors are evaluated on data that are not independent from those that were used to build them, and may lead to overly optimistic results. Circularity can first stem from the overlap between training and evaluation datasets, which may result in the well-studied phenomenon of overfitting: a tool that is too tailored to a given dataset will be more likely than others to perform well on that set, but incurs the risk of failing more heavily at classifying novel variants. Second, we find that circularity may result from an investigation bias in the way mutation databases are populated: in most cases, all the variants of the same protein are annotated with the same (neutral or pathogenic) status. Furthermore, proteins containing only deleterious SNVs comprise many more labeled variants than their counterparts containing only neutral SNVs. Ignoring this, we find that assigning a variant the same status as that of its closest variant on the genomic sequence outperforms all state-of-the-art tools. Given these barriers to valid assessment of the performance of deleteriousness prediction tools, we employ approaches that avoid circularity, and hence provide independent evaluation of ten state-of-the-art tools and their combinations. Our detailed analysis provides scientists with critical insights to guide their choice of tool as well as the future development of new methods for deleteriousness prediction. In particular, we demonstrate that the performance of FatHMM-W relies mostly on the knowledge of the labels of neighboring variants, which may hinder its ability to annotate variants in the less explored regions of the genome. We also find that PolyPhen2 performs as well or better than all other tools at discriminating between cases and controls in a novel autism-relevant dataset. Based on our findings about the mutation databases available for training deleteriousness prediction tools, we predict that retraining PolyPhen2 features on the Varibench dataset will yield even better performance, and we show that this is true for the autism-relevant dataset.

You may contact the first author (during and after the meeting) at