Fast and accurate pedigree-based imputation from sequenced data in a founder population. O. E. Livne1, L. Han1, G. Alkorta-Aranburu1, W. Wentworth-Sheilds1, L. L. Pesce2,3, C. Ober1, M. Abney1, D. Nicolae1,4,5 1) Human Genetics, The University Chicago, Chicago, IL; 2) Department of Pediatrics, The University of Chicago, Chicago, IL; 3) Computation Institute, The University of Chicago and Argonne National Laboratories, Chicago and Argonne, IL; 4) Department of Medicine, Section of Genetic Medicine, The University of Chicago, Chicago, IL; 5) Department of Statistics, The University of Chicago, Chicago, IL.
Despite decreasing DNA sequencing costs, the effects of rare genetic variants on disease risk remain hard to evaluate due to the very large required sample size, which is often prohibitively expensive or impractical to obtain. Founder populations have therefore attracted attention because many rare variants in the general populations rise to higher frequencies due to drift following the bottleneck, providing more power for association studies. Algorithms for phasing and imputation of related individuals exist, yet often fall short of maintaining high accuracy for rare variants. We present a new fast and accurate imputation algorithm that utilizes genome-wide SNP genotypes for 1414 members of the South Dakota Hutterite population, Whole Genome Sequencing (WGS) data from Complete Genomics, Inc. for 98 of those individuals, and pedigree data connecting each of them to all others and to 64 founders. First, phased haplotypes are constructed based on nuclear families and on a hidden Markov model of identity-by-descent (IBD) among the samples. The phased haplotypes are then used to build a complete IBD segment dictionary, indexed by a novel network-based method that allows fast lookup and ensures the consistency of the global IBD structure. We phased >99% of the SNP genotypes, and imputed ~11.6 million bi-allelic variants (SNPs, insertions, deletions) discovered in the WGS data to on average ~77% of the chromosomes of the 1414 individuals. Once IBD segments were indexed, the imputation required only <0.1 second per variant, and a total of 6 node hours on the University of Chicago Beagle supercomputer. Median concordance between imputed and directly genotyped data was >0.995, and was independent of minor allele frequency. We also determined high-confidence IBD-2 segments between pairs of individuals, used to perform a generalized Mendelian error check to assess the WGS datas quality. In those regions, variant calling error rates were lowest for SNPs (0.3%), intermediate for deletions (1.5%), and highest for insertions (52%). Pedigree imputation has other advantages over LD-based imputation, such as inference of the parental origin of haplotypes and the ability to impute ancestors with no available DNA. This work was supported by NIH grants HL085197 and HL21244, and in part by NIH through resources provided by the Computation Institute and the Biological Sciences Division of the University of Chicago and Argonne National Laboratory, under grant S10 RR029030-01.
You may contact the first author (during and after the meeting) at