A hierarchical multiscale model to infer transcription factor occupancy from chromatin accessibility data. A. Raj1, H. Shim1, Y. Gilad1, M. Stephens1,2, J. Pritchard1,3 1) Department of Human Genetics, University of Chicago, Chicago, IL; 2) Department of Statistics, University of Chicago, Chicago, IL; 3) Howard Hughes Medical Institute, University of Chicago, Chicago, IL.
Understanding global gene regulation critically depends on accurate annotation of regulatory elements that are functional in a given cell type. CENTIPEDE, a powerful, probabilistic framework for identifying transcription factor binding sites from tissue-specific DNase I cleavage patterns and genomic sequence content, leverages the hypersensitivity of DNA-bound sites and the information in the DNase I footprint characteristic of each DNA binding protein to accurately infer functional factor binding sites. However, this framework assumes that conditional on binding, the DNase hypersensitivity at a pair of genomic locations around the binding motif are independent; an assumption that is biologically unrealistic and is not supported by DNase I data at motifs with high Chip-Seq read depth for several different factors. In this work, we adapt a Bayesian multiscale modeling framework for Poisson processes to better model the underlying spatially structured DNase I cleavage pattern induced by the binding of a particular transcription factor. In comparison to results from CENTIPEDE, the factor-specific footprint inferred using this hierarchical model tends to be smoother and the confidence of factor binding at putative binding motifs shows improved correlation with the occupancy of that factor quantified by its Chip-Seq signal. Furthermore, we demonstrate the improved area under Receiver Operating Characteristics of this model for several transcription factors by comparing against the Chip-Seq peaks for those factors identified using MACS. Finally, we show that a straightforward extension of these models to genomic locations containing motifs with low position-weight matrix scores identifies several high-confidence binding sites, increasing the precision-recall characteristics of the learning algorithm.
You may contact the first author (during and after the meeting) at