Efficient Bayesian mixed model analysis increases association power in large cohorts. P. Loh1,2, G. Tucker3, B. Bulik-Sullivan2,4, B. J. Vilhjalmsson1,2, H. K. Finucane3, K. Galinsky5, D. I. Chasman6, B. M. Neale2,4, B. Berger3, N. Patterson2, A. L. Price1,2,5 1) Department of Epidemiology, Harvard School of Public Health, Boston, MA; 2) Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA; 3) Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA; 4) Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA; 5) Department of Biostatistics, Harvard School of Public Health, Boston, MA; 6) Division of Preventive Medicine, Brigham and Womens Hospital, Boston, MA.
Linear mixed models (LMM) are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, mixed model analysis is computationally demanding, and is becoming infeasible as study sizes approach 100,000 samples. All existing methods rely on spectral analysis of a genetic relationship matrix (GRM) at time cost O(MN2) (where N = #samples and M = #SNPs). In addition, these methods implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power (Yang et al. 2014 Nat Genet). Here, we present a far more efficient mixed model association method, BOLT-LMM, which requires only a small number of O(MN) iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. In the special case of the infinitesimal model, BOLT-LMM achieves results equivalent to existing methods at dramatically reduced time and memory cost. Algorithmically, BOLT-LMM performs O(MN)-time conjugate gradient and variational iterations to operate directly on raw genotypes stored compactly in memory, computing a retrospective score statistic that is robust to confounding while circumventing the GRM entirely. For a simulated data set of 100,000 samples typed at 300,000 SNPs, BOLT-LMM required <1 day and <8GB RAM, vs. >1 month and >150GB RAM required by existing mixed model methods; the fold-reduction in time and memory cost increases with sample size. In BOLT-LMM analysis of lipid traits in 23,294 samples from the Womens Genome Health Study (WGHS), the Bayesian non-infinitesimal model achieved up to a 7% (s.e. 1%) increase in chi-squared test statistics across known associated loci compared to standard mixed model analysis and an 8% increase compared to standard marginal analysis, consistent with simulations. In larger cohorts, theory and simulations show that the boost in chi-squared statistics - equivalent to a commensurate increase in effective sample size - increases with cohort size toward an asymptote of 1/(1-h2g), where h2g is heritability explained by genotyped SNPs, leading to even larger increases in power. BOLT-LMM software is available at http://www.hsph.harvard.edu/alkes-price/software/.
You may contact the first author (during and after the meeting) at