Precise identification of copy number variants in whole-genome data using Median Coverage Profiles. G. Glusman1, T. Farrah1, D. E. Mauldin1, A. B. Stittrich1, S. Ament1, L. Rowen1, J. C. Roach1, M. Brunkow1, M. Robinson1, A. F. A. Smit1, R. Hubley1, D. Bodian2, J. Vockley2, I. Shmulevich1, J. Niederhuber2, L. Hood1 1) Institute for Systems Biology, Seattle, WA; 2) Inova Translational Medicine Institute, Inova Health System, Falls Church, VA.

   The identification of DNA copy numbers from short-read sequencing data remains a challenge. Depth of sequencing coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. Current analysis methods frequently misidentify structural variants, particularly hemizygous deletions in the 1-100 kb range. We developed a method that enables precise identification of copy number variants (CNVs) and rare deletions in single individual genomes, based on comparison to joint profiles derived from a large cohort of genomes. The Family Genomics group ( at the Institute for Systems Biology and the Inova Translational Medicine Institute ( are undertaking multiple collaborative projects related to understanding the genetic basis of disease. We have produced high quality (>40x) whole-genome sequence (WGS) data from over 6000 individuals, including family trios and larger pedigrees. Our collective WGS dataset serves as a superb resource for modeling systematic failures and biases in sequencing technology, deriving population statistics, and developing and testing genome analysis software. We analyzed coverage in thousands of genomes sequenced using diverse technologies and processed using many versions of analysis pipelines. We scaled each genome to its total autosomal coverage, stratified by %GC. We then constructed joint profiles characterized by the median scaled value at each position along the genome. These Median Coverage Profiles (MCPs) take into account the diverse technologies and pipeline versions. MCPs can also help identify and correct batch effects. Normalization to the MCP followed by hidden Markov model (HMM) segmentation enables very efficient and precise detection of CNVs and large deletions in individual genomes. Use of multi-genome models improves our ability to analyze each individual genome, leading to fewer false positive and false negative findings. Several of the rare deletions we identified are prime disease-causing candidates in a variety of studies. We make available MCPs, HMM parameters, population frequencies for all CNVs and tools for improving the quality of personal genome analyses, individually and in the context of family pedigrees. The increased sensitivity and specificity for individual genome analysis are crucial for achieving clinical-grade genome interpretation.

You may contact the first author (during and after the meeting) at