Predicting genome-wide DNA methylation using methylation marks, genomic position and DNA regulatory elements. W. Zhang1, TD. Spector2, P. Deloukas3, JT. Bell2, BE. Engelhardt4,5,6 1) Department of Molecular Genetics and Microbiology, Duke University, Durham, NC; 2) Department of Twin Research and Genetic Epidemiology, King's College London, London, UK; 3) Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK; 4) Department of Biostatistics & Bioinformatics, Duke University, Durham, North Carolina, USA; 5) Department of Statistical Science, Duke University, Durham, North Carolina, USA; 6) Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, USA.

   DNA methylation is one of the most studied epigenetic modifications of DNA, and is known to have a role in cellular processes and complex traits and disease, including cancer. Recent assays for individual-specific fine-scale DNA methylation profiles across genome-wide CpG sites have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. To expand these studies, computational prediction of site-specific methylation status is of great interest, but approaches to date predominantly tackle methylation within a genomic locus using DNA sequence content as features and are often limited to specific genomic regions. Using data from the Illumina 450K methylation array for whole blood samples from 100 individuals, we identify striking correlation patterns of DNA methylation specific to CpG islands (CGIs), CGI shores, and non-CGIs. For example, we see what appears to be a circular pattern of correlation across the CGI shore and shelf regions. As compared to single nucleotide polymorphisms (SNPs), where linkage disequilibrium induces correlation between SNPs, correlations between neighboring CpG sites decays rapidly with genomic distance, making CpG sites less predictive of their neighboring sites, especially in regions of sparse coverage on the array. Based on these findings, we predict CpG site methylation levels using a random forest classifier, using as features neighboring CpG site methylation levels and genomic distance, and co-localization with coding regions, CGIs, and regulatory elements from the ENCODE project, among others. Our approach achieves 91%-94% prediction accuracy of genome-wide methylation levels at single CpG site precision with higher accuracy when restricting the genomic distance of neighboring CpG sites. The accuracy increases to 98% when restricted to CpG sites within CGIs. Our classifier outperforms state-of-the-art methylation classifiers and is interpretable by identifying features that contribute to prediction accuracy. Neighboring CpG site methylation status, CpG island status, co-localized DNase I hypersensitive sites, and transcription factor binding sites including Elf1(ETS-related transcription factor 1), MAZ(Myc-associated zinc finger protein), Mxi1(MAX-interacting protein 1) and Runx3(Runt-related transcription factor 3) were found to be the most predictive features of methylation levels, suggesting an interacting role for these elements in epigenetic modification and regulation.

You may contact the first author (during and after the meeting) at