ChipEnrich: gene set enrichment testing for ChIP-seq data. R. P. Welch1, C. Lee1, L. J. Scott2, R. A. Smith1, P. Imbriano2, M. A. Sartor1 1) Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI USA; 2) Department of Biostatistics, University of Michigan, Ann Arbor, MI USA.
Gene set enrichment testing is a method to identify pre-defined sets of genes that contain more experimentally relevant genes than would be expected by chance. This methodology was originally developed for the analysis of gene expression data, and has been adapted to new types of genome-wide data. Here we investigate the application of gene set enrichment testing to ChIP-seq data, specifically the locations of peaks called from piled up next-generation sequencing reads. There exist a number of challenges in applying gene set enrichment methods to this type of data. ChIP-seq peaks must be assigned to a gene, and given that no exhaustive database of gene regulatory domains exists, we must use a heuristic approach of assigning peaks to the nearest gene, the nearest TSS, or other locus definition. We define a gene locus as the region of the genome in which a peak would be assigned to a given gene. The length of a gene locus acts as a confounder, in that genes with longer locus lengths are more likely to have peaks assigned to them by chance, and therefore gene sets with longer gene loci on average will be detected as enriched. A proper test of gene set enrichment must adjust for gene locus length, as well as other potential confounders such as the mappability of the sequence in the locus. We developed a method called Chip-Enrich that empirically corrects for locus length and optionally mappability using a logistic regression model with smoothing spline terms for each covariate. We compare our method to two existing methods, Fishers exact test (FET) and GREAT, on a number of experimental ChIP-seq datasets from the literature. We illustrate a number of issues in using these existing methods, and show that our method properly corrects for the bias introduced by locus length and mappability regardless of the transcription factor binding profile. We also confirm that Chip-Enrich correctly identifies the known biology of each transcription factor, and in some cases, is able to do so where other methods cannot. Chip-Enrich will be available as a Bioconductor R package that provides the user with: 1) the ability to test their data using Fishers Exact test, ChIP-Enrich, or the binomial test used by GREAT, 2) 15 different annotation databases containing over 20,000 gene sets, 3) multiple methods of assigning peaks to genes, and 4) visualizations of various aspects of the ChIP-seq data profiles.
You may contact the first author (during and after the meeting) at