Fast PCA of very large samples in linear time. K. J. Galinsky1, P. Loh2,3, G. Bhatia2, S. Georgiev4, S. Mukherjee5, N. J. Patterson3, A. L. Price1,2,3 1) Department of Biostatistics, Harvard School of Public Health, Boston, MA; 2) Department of Epidemiology, Harvard School of Public Health, Boston, MA; 3) Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA; 4) Stanford University School of Medicine, Palo Alto, CA; 5) Department of Statistical Science, Duke Institute for Genome Sciences & Policy, Durham, NC.

   Principal components analysis (PCA) is an effective tool for inferring population structure and correcting for population stratification in genetic data. Traditionally, PCA runs in O(MN2+N3 ) time, where M is the number of variants and N is the number of samples. Here, we describe a new algorithm, fastpca, for approximating the top K PCs that runs in time O(MNK), making use of recent advances in random low-rank matrix approximation algorithms (Rokhlin et al. 2009). fastpca avoids computing the GRM and associated computational and memory storage costs, enabling PCA of very large datasets on standard hardware. We estimated the top 10 PCs of the WTCCC dataset (16k samples, 101k variants) in roughly 7 minutes while consuming 1GB of RAM, compared to 1 hour and 2.5GB for PLINK2. The fastpca approximation was extremely accurate (r2>99% between all fastpca and PLINK2 PCs). The improvement in running time becomes even larger at larger samples sizes; for example, fastpca estimated the top 10 PCs of a simulated data set with 100k samples and 300k variants in 135 minutes 8.5GB of RAM, vs. an estimated 350 hours and 85GB of RAM using PLINK2. A recently published O(MN2) time method, flashpca, did not complete on this data set due to exceeding 40GB memory requirement. All of these analyses were based on LD-pruning SNPs with r2>0.2, which leads to much more accurate PCs in simulations as compared to retaining all SNPs; more complex LD-adjustment strategies provide only a small further improvement.

You may contact the first author (during and after the meeting) at