Using Coalescent-Based Modeling for Large-Scale Fine Mapping of Complex Trait Loci using Sequencing Data in Large-Scale Case-Control Studies. Z. Geng1, P. Scheet3, S. Zöllner1,2 1) Dept Biostatistics, Univ Michigan, Ann Arbor, MI; 2) Dept Psychiatry, Univ Michigan, Ann Arbor, MI; 3) Dept Epidemiology, Univ Texas MD Anderson Cancer Center, Houston, TX.
Association mapping based on linkage disequilibrium (LD) is widely used to identify genomic regions containing disease variants. However, due to the complicated genetic dependence structure, identifying the underlying risk variants for complex diseases is challenging. By modeling the evolutionary process that produces our sequencing data, coalescent-based approaches may extract more information to improve such mapping. Such methods provide the genealogy at all sites in the region we have sequenced. Therefore, we can model the probability of carrying risk variants at all loci jointly, and obtain Bayesian confidence intervals (CIs) where true risk variants are most likely to occur. Additionally, the genealogy at each position provides more information about the shared ancestry of neighboring sites. Indeed, such careful modeling of the shared ancestry of sequences may also be beneficial in haplotyping and variant calling in regions of interests (ROI) where traditional hidden Markov approaches struggle. However, existing coalescent-based methods typically suffer from a major challenge: computational intensity. Here, we propose a novel approach to overcome such difficulty, so that it can be applied to large-scale studies. First, we infer a set of clusters from the sampled haplotypes so that haplotypes within each cluster are inherited from a common ancestor. Then, we apply coalescent-based approaches to approximate the genealogy of ancient haplotypes at different positions across the ROI. Doing so, the dimension of external nodes in coalescent models is reduced from the total sample size to the number of clusters. Finally, we evaluate the position-specific cluster genealogy and their descendants phenotype distribution, to integrate over all positions and establish CIs where risk variants are most likely to occur. In simulation studies, our method correctly localizes short segments around true risk positions for both rare (1%) and common (5%) risk variants in datasets with thousands of individuals, as opposed to traditional coalescent-based approaches that typically restrict the sample size to a few hundreds. In summary, we have developed a novel approach to estimate the genealogy throughout sequenced regions. In fine mapping of complex trait loci, our method is applicable for large-scale case-control studies using sequencing data.
You may contact the first author (during and after the meeting) at