A scalable pipeline for local ancestry inference using thousands of reference individuals. C. B. Do, E. Durand, J. M. Macpherson, B. Naughton, J. L. Mountain 23andMe, Inc, Mountain View, CA.
Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, is straightforward when the ancestral populations considered are sufficiently distinct. To date, however, no approaches have been shown to be effective at distinguishing between closely related populations (e.g., within Europe). Moreover, due to their computational complexity, most existing methods for ancestry deconvolution are unsuitable for application in large-scale settings, where the reference panels used contain thousands of individuals.
We describe Ancestry Painting 2.0, a modular three-stage pipeline for efficiently and accurately identifying the ancestral origin of chromosomal segments in admixed individuals. In the first stage, an out-of-sample extension of the BEAGLE phasing algorithm is used to generate a preliminary phasing for an unphased, genotyped individual. In the second stage, a support vector machine (SVM) using a specialized string kernel assigns tentative ancestry labels to short local phased genomic regions. In the third stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the SVM labels.
We compiled a reference panel of over 7,500 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 5,000 individuals reporting four grandparents with the same country-of-origin from the customer database of the personal genetics company, 23andMe, Inc, and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Painting 2.0 achieves high sensitivity and specificity (in most cases 90%) for labeling chromosomal segments across over 20 different populations worldwide. We also demonstrate the robustness of the algorithm via simulations of individuals of known local admixture, and compare Ancestry Painting 2.0 with existing state-of-the-art tools for multi-population local and global ancestry inference, including LAMP, ALLOY, PCA-ADMIX, and ADMIXTURE.
You may contact the first author (during and after the meeting) at