Haplotype phasing across the full spectrum of relatedness. J. O'Connell1,2, O. Delaneau2, N. Pirastu3, S. Ulivi4, M. Cocca5, M. Traglia5, J. Huang6, J. E. Huffman7, I. Rudan8, R. McQuillan8, R. M. Fraser8, H. Campbell8, O. Polasek9, C. Hayward7, A. F. Wright7, V. Vitart7, P. Navarro7, J. F. Zagury10, J. F. Wilson8, D. Toniolo5, P. Gasparani3, N. Soranzo6, J. Marchini1,2 1) The Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom; 2) Department of Statistics, University of Oxford, Oxford, United Kingdom; 3) Institute for Maternal and Child Health - IRCCS Burlo Garofolo, University of Trieste, Trieste, Italy; 4) Institute for Maternal and Child Health - IRCCS Burlo Garofolo, Trieste, Italy; 5) Division of Genetics and Cell Biology, San Raffaele Scientific Institute, Milano, Italy; 6) Wellcome Trust Sanger Institute, Hinxton, United Kingdom; 7) MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, Scotland; 8) Centre for Population Health Sciences, University of Edinburgh, Edinburgh, Scotland; 9) Faculty of Medicine, University of Split, Split, Croatia; 10) Laboratoire Génomique, Bioinformatique, et Applications (EA4627), Conservatoire National des Arts et Métiers, Paris, France.

   Many existing cohorts contain a range of relatedness between genotyped individuals, either by design or by chance. Haplotype estimation (phasing) in such cohorts is a central step for many downstream analyses. Cohorts sampled from population isolates offer the opportunity for long range phasing of individuals, which involves leveraging recent ancestry between individuals for extremely accurate haplotype inference. This idea was first popularised in a well known paper by Kong et al. (2008) but tractable software for this approach is not available. Extended pedigrees may also be present amongst a wider cohort of unrelated individuals, a Lander-Green algorithm (eg. Merlin) is the traditional method of choice for pedigree phasing but this approach has several limitations. Using genotypes from six cohorts from isolated populations and one cohort from a non-isolated population we have investigated the performance of different phasing methods designed for unrelated individuals. We find that SHAPEIT2 produces much lower switch error rates in all cohorts compared to other methods, especially in identity-by-descent (IBD) sharing regions (less than 0.1% error rate). This occurs because the SHAPEIT2 algorithm implicitly looks for stretches of shared haplotypes between individuals, and can be thought of as a generalization of the long range phasing approach. We show that SHAPEIT2s performance in IBD regions also translates to very accurate phasing for pedigrees. We introduce a novel HMM that can further improve accuracy by integrating family information with the SHAPEIT2 haplotypes, giving us an effective method for dealing with extended pedigrees. The model allows us to accurately detect recombination events in a manner that is robust to genotyping error. We show that our method detects numbers of recombination events that align very well with expectations based on genetic maps whereas Merlin produces inflated recombination rates due to its sensitivity to genotyping error. Our technique even has some ability to detect recombination events in parent-child duos that are not part of a wider pedigree, something that is impossible with a pedigree-only approach. In summary, this work demonstrates methodology for haplotype inference in cohorts with any degree of relatedness that produces haplotypes with unparalleled accuracy.

You may contact the first author (during and after the meeting) at