Automating literature reviews: Predicting variant pathogenicity using the bibliomic index. C. A. Cassa1, D. M. Jordan2, S. R. Sunyaev1 1) Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, BOSTON, MA; 2) Graduate Program in Biophysics, Harvard University, Cambridge, MA.
Clinical geneticists and researchers rely on the medical and scientific literature to interpret potential disease variants. While the GWAS community has developed stringent assessment standards to avoid false positive associations, Mendelian disease variants are typically assessed using inconsistent platforms and validation standards. Many variants are identified in small, symptomatic populations, so their effect size may be incorrect, or they may be erroneously associated with disease due to limited validation or unmatched control populations. The result is an admixture of trusted associations with unverified, or even incorrect variants.
The consequence is that clinical interpretation of these variants often requires manual review to ascertain effect size and clinical significance. This approach will not scale with the exponential growth of clinical sequencing programs, as we observe previously reported disease variants at substantial rates in sequenced individuals, many of which require manual review.
Based on the idea that there are valuable published disease associations, but that it is difficult to distinguish the importance of any specific citation, we attempt to use the literature - in aggregate, by gene - to predict the pathogenicity of individual variants. Using a large set of publications that describe disease associations (HGMD), we develop a novel statistic called the bibliomic index, which uses publication impact and citation frequency (Thompson Reuters) to predict variant pathogenicity. Using an independent dataset that is restricted to known disease genes, variants that have higher bibliomic index scores are more likely to be rare, pathogenic variants. The three features in our bibliomic index have very strong predictive value, achieving an AUC of 0.8584. This information compliments existing computational methods, which rely on structural and evolutionary factors; when combined with PolyPhen-2, we achieve an AUC of 0.9408.
This demonstrates that aggregate bibliomic data can substantially improve the current arsenal of in silico predictors, mitigating the challenges traditionally associated with the accession and parsing of manuscripts. These features may be used to prioritize and contextualize candidate disease variants in known disease genes.