Underdog: A Fully-Supervised Phasing Algorithm that Learns from Hundreds of Thousands of Samples and Phases in Minutes. K. Noto, Y. Wang, M. Barber, J. Granka, J. Byrnes, R. Curtis, N. Myres, C. Ball, K. Chahine AncestryDNA, San Francisco, CA.
Algorithms that phase, i.e., that separate diploid genotypes into a pair of haplotype chromosomes, traditionally do so by phasing many samples together, comparing the genotypes and potential haplotypes to others in the input, and iteratively improving the phase. The larger the input set, the more accurate the phase. However, when the input contains hundreds of thousands of samples, these algorithms become intractable, forcing users to discard potentially useful data. Furthermore, the entire process must be repeated to phase new samples. We suggest that if a training set is large enough, it can be used to build haplotype models that can phase new samples quickly and accurately without requiring that the new samples be used to determine the models. We present a new approach called Underdog, which learns haplotype models from hundreds of thousands of haplotype samples and saves those models for later reuse, enabling the user to rapidly phase new samples. Our results on two experimental data sets show that Underdog phases new samples with 20%-60% fewer errors than current state-of-the-art approaches, and because Underdog takes advantage of parallelization, it can do so in minutes instead of hours (a 100-fold reduction in running time is typical).
You may contact the first author (during and after the meeting) at