Mining genomic feature sets and identifying significant biological relationships with BedTools2. A. Quinlan, N. Kindlon Center for Public Health Genomics, University of Virginia, Charlottesville, VA.
Modern DNA sequencing technologies are enabling unprecedented explorations of the spectrum of functional elements in diverse cell types. The fundamental result of such large projects is a complex, multi-dimensional collection of signals such as ChIP-seq peaks, DNA methylation sites, and RNA-seq measurements, that are scattered throughout the genomes of hundreds of different cell types. While these datasets are crucial to gaining insight insight into genome regulation in the context of human disease, even basic analyses pose substantial computational and statistical challenges. The datasets are large, complex, and employ myriad file formats. Moreover, revealing new biological relationships such as co-associated regulatory elements depends upon choosing a relevant statistical metric. The inherent analytical complexity, computational burden, and debate about the choice of appropriate statistics motivated us to develop a standardized analysis toolkit for the genomics community.
Building upon our widely used bedtools genomic analysis software, we have developed bedtools2, a scalable toolkit for mining genomic feature sets and identifying significant biological relationships among them. We have completely re-engineered the core algorithms in bedtools2 to scale to analyses involving hundreds of datasets described in any common genomics file format (e.g., BAM, BED, VCF).
In the context of biological discovery, the most exciting new functionality in bedtools2 is a comprehensive set of statistical measures for revealing associations between sets of genomic features (e.g., do these transcription factors co-associate more than expected by chance?). Here we present the new statistical tests in bedtools2, compare our tests to existing approaches such as the ENCODE Genome Structure Correction (GSC) metric, and provide needed insights into which metrics are most appropriate to common biological questions. We demonstrate typical misuses of these metrics and illustrate how our tests and associated visualization tools can reveal new biological insights. Given the speed and analytical flexibility of bedtools2, we anticipate that our new toolkit will be an invaluable resource for geneticists studying the impact of genetic variation and regulatory elements on human disease phenotypes.
You may contact the first author (during and after the meeting) at