Application of Clinical Text Data for Phenome-Wide Association Studies (PheWASs). S. J. Hebbring, M. Rastegar-Mojarad, Z. Ye, J. Mayer, C. Jacobson, S. Lin Marshfield Clinic Research Foundation, Marshfield, WI.

   Genome-Wide Association Studies (GWAS) have proven effective in describing the genetic complexities of common diseases. Phenome-Wide Association Studies (PheWASs) using diagnostic codes embedded in electronic medical record (EMR) systems have proven effective as an alternative/complementary approach to GWAS. The PheWAS technique has the capacity to identify novel gene-disease associations and link multiple conditions to a common genetic etiology. The majority of PheWASs published to date have utilized ICD9 diagnostic codes to define cases and controls, but it has been shown that ICD9 codes have limited utility. ICD9 codes are primarily used for billing, can have limited phenotypic granularity, and often do not allow for other clinically relevant information to be used for PheWAS interpretation.
    As an alternative to ICD9 coding, a text-based phenome was defined from 1,564,831 clinical notes from 4,204 patients containing 423,537,905 words linked to Marshfield Clinics EMR system. Clinical text data were cross referenced with the UMLS Medical Dictionary of disease terms and drug names to enrich for 23,384 clinically relevant word strings that defined the text-based phenome. Five SNPs known to be associated with different phenotypes were genotyped on the 4,204 patients and associated across the text-based phenome. All five SNPs had expected word strings associated with SNP genotype (p<0.02) with most at or near the top of their respective PheWAS ranking. For example, SNP rs1061170, a SNP in CFH that is known to be associated with age related macular degeneration (AMD), was strongly associated with AMD related word strings including macular degeneration (p=1.8E-8), nonexudative (p=2.3E-7), exudative (p=1.4E-6), and visudyne, a drug commonly prescribed to treat AMD (p=3.9E-7). When comparing results from the text-based PheWAS and an ICD9-based PheWAS, the text-based PheWAS performed equivalently to the ICD9-based PheWAS with three of the five SNPs having stronger p-values.
    In conclusion, this study demonstrates for the first time that raw text data from clinical notes in an EMR system can be used effectively to define a phenome. This study also validates that clinical text data, including drug data, can be applied to a PheWAS as an alternative and complementary approach to a GWAS or ICD9-based PheWAS.