A new method for genotype calling and phasing for the 1000 Genomes Project leads to improved downstream imputation accuracy. O. Delaneau1, A. Menelaou2, J. Marchini1, The 1000 Genomes Project Consortium 1) Department of Statistics, University of Oxford, Oxford, Oxfordshire, United Kingdom; 2) University Medical Centre Utrecht, Utrecht, Netherlands.
The 1000 Genomes Project has pioneered the use of low coverage sequencing, followed by LD-based genotype refinement, for the construction of comprehensive haplotype reference panels. This approach has become popular in many other studies of population and disease cohorts. We have made clear improvements to this strategy by developing a new LD-based genotype refinement approach. Firstly, we take advantage of genome-wide SNP chip genotypes available on the project samples. We first phase these genotypes using SHAPEIT2 to create a dense haplotype scaffold across the genome. Since a large proportion of the samples are part of trios and duos the haplotype scaffold is very accurate. In the second step, we use the low-coverage sequencing data and phase each novel variant site onto the haplotype scaffold. To do this we have extended the SHAPEIT2 approach to work with low-coverage sequencing data and to accommodate a phased haplotype scaffold. We have applied this approach to the Phase 1 dataset, as well as new, larger sets of project samples. On the Phase 1 dataset we produce genotype callsets with lower error rates than other methods by at least 25%. More importantly, our new haplotype reference panel leads to improved downstream imputation accuracy in GWAS samples. For example, for SNPs with a MAF of 1% we observe an increase of 0.1 on the R2 scale when we compared imputed genotypes to validation genotypes obtained from high-coverage Complete Genomics sequencing data. A key advantage of this scaffold-based approach is that other variant types such as indels, deletions, STRs and multi-allelic variants can also be phased onto the haplotype scaffold in a highly parallel scheme. For example, we have also developed new methods that can phase deletions, which have variable ploidy, and multi-allelic variants, such as small regions containing SNPs and indels, onto the scaffold. Overall, this strategy is being adopted to process the final 1000 Genomes Project release.
You may contact the first author (during and after the meeting) at