A genotype likelihood based phasing and imputation method for massive sample sizes of low-coverage sequencing data. W. Kretzschmar1, J. Marchini1,2, The Haplotype Reference Consortium 1) Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom; 2) Department of Statistics, University of Oxford, Oxford, United Kingdom.

   Calling genotypes from low-coverage sequencing data is a computationally challenging task. Applying existing methods to cohorts of a few thousand samples typically takes many weeks on large-scale compute clusters. Such methods will not scale to calling genotypes for the first release of the Haplotype Reference Consortium (HRC) (http://www.haplotype-reference-consortium.org/), which will consist of ~31,500 samples. We have developed methods that substantially cut the running time of calling genotypes in the HRC. Our method is an adaptive MCMC algorithm for genotype calling and haplotype imputation, derived from SNPTools (Wang et al. Genome Res 2013), that learns local haplotype clustering as the MCMC chain progresses, and acts to guide the proposal distribution for haplotype sharing between samples. This adaptive scheme is very flexible and can naturally accommodate any existing haplotypes estimates. Our method also supports phasing from reference panels and the inclusion of knowledge about family structure. In addition, we have implemented this approach on a GPU resulting in a dramatic increase in speed. To illustrate the improvements in speed we applied several methods to a single chunk of 1,024 sites from the HRC pilot project consisting of genotype likelihoods on 12,753 samples. Beagle (v3.3.2) took 1016m (averaged over 12 regions). Our new method took 65m, and the GPU implementation of our method took 3m at comparable accuracy. These new methods provide a computational solution for calling genotypes in next-generation sequencing studies of tens to hundreds of thousands of samples. We plan to provide a public software implementation of our GPU method that can be run via cloud computing.

You may contact the first author (during and after the meeting) at