A Novel Statistical Framework for Using Out of Study Control Groups in Association Studies with Next Generation Data. A. Derkach1, T. Chiang2, L. Addis3, S. Dobbins4, I. Tomlinson5, R. Houlston4, D. K. Pal3, J. Gong2, L. J. Strug2,6 1) Statistics, University of Toronto, Toronto, ON, Canada; 2) Program in Child Health Evaluative Sciences, the Hospital Sick Children, Toronto, ON, Canada; 3) Department of Clinical Neuroscience, Institute of Psychiatry, Kings College London, UK; 4) Institute of Cancer Research, London, UK; 5) Welcome Trust Centre for Human Genetics, Oxford, UK; 6) Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
Genome wide next generation sequence (NGS) data, such as that from the 1000 Genomes Project, is publicly available. However, these data have generally been used as a blunt comparative tool to identify novel or rare variants, and not properly exploited as controls for association studies. One explanation for the underutilization of these data for association may be the existence of several potential biases or confounding factors such as differences in sequencing platforms; alignment, SNP and variant calling algorithms; read depth, and selection thresholds. Here we focus on the effect of read depth and bioinformatic aspects of variant calling in comparing allele frequencies between cases and controls that were resequenced as part of different experiments. We assume that other potential confounding factors are reasonably well matched. We illustrate analytically, and by simulation, how differences in read depth and variant screening parameters affect Type 1 error. We propose a novel likelihood-based method that re-purposes and extends an approach by Skotte et al. (2012). We suggest substituting genotype calls by their expected values given the observed sequence data to eliminate read depth bias from estimation of minor allele frequency (MAF). We then incorporate read depth differences into the variance estimation to control between-study variation in read depth. We conduct a comprehensive simulation study to show that our method controls Type 1 error when cases and controls are resequenced at different read depth, and show this applies to association studies using single or multiple variants. We applied this method to NGS data from a 600kb linkage region for an epilepsy endophenotype present in ~2% of the population. We used long-range PCR and NGS on 27 epilepsy cases, with average read-depth of 197x. Using the BAM files from the 174 low read depth (LRD) 1000 genomes controls (release 21/05/11) and the 27 high read depth (HRD) epilepsy cases we calculated the expected genotypes given the observed sequence reads to compare the two groups. We compared our findings to an analysis using variant calls from the 27 HRD epilepsy cases and 200 HRD controls sequenced by Complete Genomics (~35x). We show that the proposed method removes bias and identifies the same associated variants as analysis with the HRD group. In conclusion, out-of-study control groups can be used in association studies as a way to prioritize variants for follow-up in more focussed studies.
You may contact the first author (during and after the meeting) at