Second-generation PLINK: rising to the challenge of larger and richer datasets. C. C. Chang1,2, C. C. Chow3, L. C. A. M. Tellier2,4, S. Vattikuti3, S. M. Purcell5,6,7,8, J. J. Lee3,9 1) BGI Hong Kong, 16 Dai Fu Street, Tai Po Industrial Estate, Tai Po, N.T., Hong Kong; 2) BGI Cognitive Genomics Lab, Building No. 11, Bei Shan Industrial Zone, Yantian District, Shenzhen, China 518083; 3) Mathematical Biology Section, NIDDK/LBM, National Institutes of Health, Bethesda, MD 20892; 4) Bioinformatics Centre, University of Copenhagen, 2200 Copenhagen, Denmark; 5) Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142; 6) Division of Psychiatric Genomics, Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY 10029; 7) Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029; 8) Analytic and Translational Genetics Unit, Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, MA 02114; 9) Department of Psychology, University of Minnesota Twin Cities, Minneapolis, MN 55455.
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format.
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information.
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
You may contact the first author (during and after the meeting) at