Insights into the genetic architecture of African genomes: the African Genome Variation Project. I. Tachmazidou for the AGVP investigators The Wellcome Trust Sanger Institute, Cambridge, United Kingdom.

   Genome-wide association studies in populations from sub-Saharan Africa are eagerly anticipated, but there is a paucity of genetic data to inform powerful study design. Pronounced genetic diversity across ethnic groups within SSA, in conjunction with low levels of LD and differences in haplotype structure, give rise to statistical genetics challenges when designing and conducting genomic epidemiology studies. The African Genome Variation Project is a collaboration across the African Partnership for Chronic Disease Research, the Centre for Research on Genomics and Global Health and the Malaria Genomic Epidemiology Network. Our aim is to facilitate genome-wide association studies in diseases of relevance to African populations by providing first insights into the genetic variation landscape of different ethnic groups. To achieve this, we have genotyped 100 unrelated individuals from each of 18 ethnolinguistic groups from 7 SSA countries (Kenya, Nigeria, Uganda, Ethiopia, Ghana, the Gambia, South Africa) on the 2.5 million SNP Illumina platform. We are examining: 1) the allele frequency spectrum of variants on the chip; 2) patterns of LD; 3) the proportion of common variation captured by the array; 4) imputation-based approaches aiming to increase genetic association study power; and 5) analytical challenges and the need for new statistical genetics methods to address them. We find that between 1.10 and 1.36 million SNPs have MAF>5% and that between 240 and 490 thousand SNPs are monomorphic depending on the population examined. We also find that there are high levels of redundancy on the chip, as calculated based on pairwise correlation between variants in each ethnolinguistic group; for example, for 40-57% of common variants there is at least one more variant with r2 over 0.8 on the chip, whereas 16-35% of common variation has a perfect proxy on the chip. Based on whole genome sequence data, we find an upper threshold of 70% of MAF>5% variants captured by the array (50% for MAF>1%) at an r2 of 0.8. To explore the utility of SSA groups to serve as imputation reference panels for other SSA populations, we imputed Baganda, Ethiopia and Zulu samples on the 1000 Genomes low coverage sequence data. Correlation between the input genotypes and the expected genotypes varies between 60-75% for MAF<5% and between 70-88% for MAF>5% depending on the population examined.

You may contact the first author (during and after the meeting) at