A practical guide to study design, sample size requirement and statistical analyses methods for rare variant disease association studies. S. M. Leal, G. T. Wang, D. Zhang, Z. He, H. Dai, B. Li Center for Statistical Genetics, Department of , Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030.

   We evaluated rare variant association (RVA) study designs and methods using real world data from NHLBI-Exome Sequencing Project (ESP) as well as exome sequence data which was simulated using state-of-the-art demographic models with purifying selection modeled after the empirical distribution of functional variants in ESP. Our simulated data is highly consistent with real world data distribution of singleton, doubleton and tripleton variants as well as the cumulative minor allele frequency (MAF) of variants for Europeans and Africans. Using resampled genotypes from ESP European American sequences, we evaluated relative power of 10 RVA methods by analyzing 16,568 genes across the genome, and we demonstrated that a method most powerful for one gene is not necessarily the most powerful for another, simply due to differences in the genomic sequence context, i.e., gene specific MAF spectrum and distribution of functional variants, rather than phenotypic model assumptions. Using simulated data of European samples we evaluated impact of phenotypic model, missing data, non-causal variants and choice of empirical MAF cutoff in RVA analysis. We found that the assumption of variable effects model favors variable threshold tests (e.g. VT) greatly, but the power gain of weighted burden tests (e.g. WSS) are marginal compare to the constant effect model. In the presence of strongly protective variants, SKAT-O/SKAT are consistently the most powerful tests, but they perform poorly when protective effect is mild compare with detrimental effect. The impact of non-causal variants and missing data are more significant than the choice of RVA methods, and the enrichment of functional variants is most crucial to the success of most RVA methods. The number of samples which need to be studied are highly dependent on gene size and the number of variant sites, for example under the assumption of moderate effect size for causal variants, i.e., odds ratio 2.0, for genes with short coding region lengths (~400bp), >90,000 samples are required to achieve a power of 80% to detect an association using an exome-wide significant level of = 2.510-6 while for average sized genes (~1,400bp), the required sample size is >50,000. We also showed that for exome chip design, the impact of exclusion of singletons and doubletons from observed samples are minimal compare to the impact of the large proportion of variants that were excluded from exome chip design due to inevitable sampling bias.

You may contact the first author (during and after the meeting) at