Extraction and analysis of clinical traits of multiple sclerosis using electronic medical records. M. F. Davis1, S. Sriram2,3, W. S. Bush1,4, J. C. Denny4, J. L. Haines1,2 1) Center for Human Genetics Research, Vanderbilt Univ, Nashville, TN; 2) Dept of Neurology, Vanderbilt Univ, Nashville, TN; 3) Vanderbilt Multiple Sclerosis Center, Vanderbilt Univ, Nashville, TN; 4) Dept of Biomedical Informatics, Vanderbilt Univ, Nashville, TN.

   The clinical course of multiple sclerosis (MS) is highly variable, and research data collection is costly and time-consuming. We evaluated natural language processing techniques applied to electronic medical records (EMR) to identify MS patients and key clinical traits of disease course. We used four algorithms based on ICD-9 codes, text keywords, and medications to identify individuals with MS from a de-identified, research version of the EMR at Vanderbilt University. After identification of MS patients, we developed algorithms to extract detailed MS features capturing the clinical course of MS, including clinical subtype, presence of oligoclonal bands, year of diagnosis, year and origin of first symptom, Expanded Disability Status Scale (EDSS) scores, timed 25 foot walk scores, and MS medications. Algorithms were evaluated on a test set validated by two independent reviewers. We identified 5,789 individuals with MS. Positive predictive values for the clinical trait algorithms ranged from 87-99%. Recall values for clinical subtype, EDSS scores, and timed 25 foot walk scores were greater than 80%. DNA was available for 1,086 of the individuals through BioVU. These samples and 2,396 control samples were genotyped on the ImmunoChip. After extensive sample and SNP quality control, 1,031 cases, 2,226 controls, and 160,046 SNPs remained for analysis. At a nominal p-value of 0.05, 29 known MS loci were replicated in case-control analysis, further confirming the MS disease status of cases. Genome-wide analyses were conducted for each of the extracted MS features using linear regression for continuous measures, logistic regression for presence of oligoclonal bands, and Cox proportional-hazards regression for time to secondary progressive (SPMS). Analyses were adjusted for the first three principal components. No associations reached genome-wide significance, although multiple loci were associated in each analysis at a significance level p < 1 x 10-5. The most significant result from time to SPMS analysis (127 individuals) was less than 100kb upstream from CADM3, which encodes a brain specific protein associated with inflammation, a hallmark feature of MS (p=3 x 10-7). This work demonstrates that detailed clinical information is recorded in the EMR and can be extracted with high reliability, and that this data can be used to further understanding of the genetics of MS.