Inferring ancient demography using whole-genome sequences from multiple individuals. M. Steinruecken1, J. Kamm2, Y. Song1,2 1) EECS, University of California, Berkeley, CA; 2) Statistics, University of California, Berkeley, CA.
Uncovering the demographic history of present-day populations, especially of humans, has received a lot of attention, since knowledge of this history is necessary to correctly interpret the results of association studies, or the action of evolutionary forces that shaped the genetic variation observed today. Quantities of interest include the divergence times of species or subpopulations together with the intensity and duration of subsequent gene flow; as well as the sizes of ancestral populations. Despite significant progress in coalescent theory in the last decades, full-likelihood inference from a sample of DNA sequences under a suitable structured coalescent model is still infeasible. Some existing methods incorporate linkage information, but are limited to a small sample size, prohibiting accurate inference in the recent or very distant past. Other studies used large sample sizes to enable inference about the more recent past, but this prohibited the inclusion of linkage.
To benefit from linkage information and larger sample sizes, we developed a method based on the conditional sampling distribution (CSD). The CSD describes the distribution of an additionally sampled haplotype conditional on having already observed a given set of sequences. Combining ideas from the structured coalescent and the Sequential Markov Coalescent, we devised a Hidden Markov model (HMM) that can be used to efficiently and accurately approximate the true CSD. This approximate CSD can be applied in suitable composite likelihood frameworks to approximate the probability of observing a given set of sequences under a certain demographic scenario. The fact that our model can be cast as an HMM allows for efficient inference of demographic parameters using an Expectation-Maximization approach.
We demonstrate the performance of our inference procedure through extensive simulations. We show that our method can accurately recover biologically relevant demographic parameters like population divergence times, migration rates, or ancestral population sizes from simulated datasets. We apply our method to human genomic sequence data to demonstrate its utility in learning about human demographic history. Applying our CSD in frameworks for phasing genotypes or imputation of missing sequence data would make it possible to account for substructure in the underlying population, thus potentially increasing accuracy.
You may contact the first author (during and after the meeting) at