An automated method for extracting normalized mentions of human genes and proteins in biomedical text. P.S. White1,2, K. Murphy2, R. O'Hara2, M. D'arcy2, S. Carroll2, Y. Jin2, H-R. Fang2, J. Kim2, M. Mandel3, M. Liberman3,4,5, R. McDonald5, F. Pereira5. 1) Division of Oncology, Children's Hospital of Philadelphia; 2) Department of Pediatrics, University of Pennsylvania; 3) Linguistic Data Consortium, University of Pennsylvania; 4) Department of Linguistics, University of Pennsylvania; 5) Department of Computer and Information Science, University of Pennsylvania.
We developed an automated text mining process to identify and normalize references to human genes in biomedical text. Gene named entity recognition (NER) was performed using a machine-learning algorithm that considers semantic and syntactic features in text. Identified mentions were then normalized to standard gene names using vocabulary and approximate string matching. The combined gene NER and normalization process performed document retrieval with 95.7% precision and 85.7% recall at the document level. When applied to MEDLINE, the process identified 36,953,389 gene mentions, 17,897,933 of which normalized to 14,501 human genes. We built a web interface (FABLE) to retrieve MEDLINE articles mentioning human genes, and to compile lists of keyword-defined concepts and articles. FABLE supports searches for gene names and aliases and returns MEDLINE articles in which the query genes are mentioned, regardless of which gene alias(es) were used in the article. Results can be sorted in various ways, including by query relevance and journal impact factor. FABLE demonstrated 93.9% accuracy when comparing its ten most relevant articles with PubMed for 50 random genes. FABLE identified on average 33% more articles than PubMed. FABLE also allows users to generate lists of MEDLINE-mentioned genes implicated in any keyword-defined concept (e.g., schizophrenia NOT bipolar). A query of FABLE with a set of keywords results in a list of genes that co-occur in an article with the input keyword(s). Lists consist of normalized gene symbols, the number of articles in which each gene is mentioned, and the implicating articles. FABLE-generated gene list evaluation indicated comparable precision and higher recall than manually-established lists. Access FABLE at http://fable.chop.edu.