A statistical framework to leverage broad metabolite data in elucidating the associations between genetics and disease. C. Churchhouse, Slim Initiative in Genomic Medicine for the Americas (SIGMA) Type 2 Diabetes Consortium Broad Institute of Harvard and MIT, Cambridge, MA.
Genome-wide association studies have been applied to a broad spectrum of complex diseases, for which large consortial efforts have increased power to find true associations. Despite success in reproducibly identifying risk variants, GWAS have, in general, fallen short of elucidating pathophysiology. In combining genomic data with traits that may be biomarkers of risk or intermediary to a disease end point, we may shed more light on the underlying etiology. The advent of systematic metabolite profiling has enabled the quantification of thousands of metabolites in vivo, rendering the variation within the human metabolome accessible by analytical approaches, much like the genome. Large studies have been published in which broad panels of metabolites were analyzed as traditional GWAS or examined for links to disease risk, but there remain many challenges surrounding the use of metabolite data, some of which are analogous to those met in the field of quantitative genetics. This abstract describes progress on statistical methods developed to address these considerations. Specifically, these challenges include identifying and accounting for technical confounding, such as batch effects, which can introduce cryptic structure in the metabolite data. We have found, for example, that patterns of missing values relate to the order in which samples were profiled, requiring statistical quality control (QC) methods to avoid bias and false positives in downstream analyses. A further consideration is the highly correlated structure of metabolites resulting from the underlying molecular pathways through which they are related. Our approach leverages existing knowledge of these metabolic networks to both inform QC techniques and to reduce the dimensionality of the data and thus the penalty incurred for multiple hypothesis testing. Another challenge we address is potential confounding due to population structure and admixture that will become more problematic as metabolomics is applied to larger cohorts and a wider range of ethnicities. We will illustrate these methods on an empirical data set that includes ~12,000 metabolites measured in 865 serum samples collected at baseline in the longitudinal Mexico City Diabetes Study. Additionally, we have OMNI array, exome chip and exome sequence genotypes through which to investigate the application of these methods to understanding the associations between the triad of genetic variation, metabolism, and type 2 diabetes risk.
You may contact the first author (during and after the meeting) at