Computationally-efficient long-range phasing with very large datasets. M. J. Barber1, R. E. Curtis2, K. Noto1, Y. Wang1, J. M. Granka1, N. M. Myres2, J. K. Byrnes1, C. A. Ball1, K. G. Chahine2 1) Ancestry.com, San Francisco, CA; 2) Ancestry.com, Provo, UT.

   While computationally intensive, phasing a large set of genotypes (i.e. > 100,000 samples) into probable haplotypes presents the opportunity to leverage sample size to increase phasing accuracy for each and every sample. Phasing switch error rate can be greatly minimized when using genotypes with a specified parent-offspring relationship, but such datasets are not universally available. Recent advances in methodology have utilized the large number of identity-by-descent segments (IBD-SEGs) between a sample and the rest of the samples in the dataset to improve phasing accuracy. These long-ranged phasing approaches use an IBD-SEG to help phase the matched IBD region. At AncestryDNA, we are applying the principle of long-range phasing to very large (and growing) datasets of genotyping data. Our approach assembles high-confidence IBD-SEGs to form an explicit surrogate parent for each sample. The assembly of an explicit surrogate parent is computationally efficient: the only requirement is assessing each IBD-SEG for quality separately, rather than performing a joint analysis. Given an explicit surrogate parent, phasing and IBD-SEGs can then be updated. Our approach has the added advantage of enabling updates to the phase estimates of full or partial genotypes when new high-quality IBD-SEGs are identified, as is common in constantly growing databases such as AncestryDNA's. We test the accuracy of our approach using simulated genotyping datasets and thousands of confirmed parent-offspring relationships from the AncestryDNA database. Our novel approach aims to efficiently and accurately phase large numbers of samples in a way that could be relevant and widely practical for a variety of applications, including datasets from genome-wide association studies that are being generated by the genetics community.