Platinum Genomes: A systematic assessment of variant accuracy using a large family pedigree. M. A. Eberle1, M. Kallberg2, H.-Y. Chuang2, P. Tedder1, S. Humphray1, D. Bentley1, E. H. Margulies1 1) Scientific Research, Illumina Cambridge, Ltd, Saffron Walden, Essex, United Kingdom; 2) Dept Bioinformatics, Illumina, Inc, San Diego, CA., USA.
As next-generation sequencing technologies become widely adopted for clinical applications, it is extremely important that we have the ability to systematically assess the accuracy of variant calls generated from these data. However, at present, no such truth dataset of variant calls exists for a diploid or cancer genome. Instead, we have relied on measures of concordance with calls from alternative technologies (such as Sanger sequencing or microarrays), or by testing for inheritance errors using parent-parent-child trios as a proxy for sensitivity/accuracy measures. Both approaches have limitations that preclude us from measuring the true accuracy of sequencing technologies and variant calling algorithms. We have initiated a project to systematically identify all variation in a large three-generation family (the CEPH/Utah pedigree 1463). Both the raw sequence data and variant calls are being made publicly available. There are several key features of our initial approach to generating an extremely high confidence set of variant calls: First, all sequence data have been generated with the latest PCR-free techniques and sequenced to a higher-than-usual depth (~50x) to maximize sensitivity in low-coverage regions. Second, we have determined the haplotype inheritance structure and used this information to boost sensitivity to detect errors. Third, several variant calling algorithms have been used to leverage joint calling approaches and maximize the detection of a broad set of SNVs and indels.
To illustrate the increased sensitivity of error/accuracy calculations when multiple siblings are analyzed in parallel we have analyzed the single-nucleotide polymorphisms (SNPs) and indels within the parents and eleven offspring of this family. Based on this we have identified over 4.7M SNPs and 640K indels that we predict are correctly genotyped across the parents and 11 siblings corresponding to an additional ~360K SNPs and ~95K indels per sample compared against a normal quality-filtered call set. For the variants that show Mendelian conflicts in the pedigree we have identified that the majority are related to cell line mutations including ~2000 cell line de novo SNPs per sample and large cell line deletions. We will present this method and assess the role that de novo cell line mutations and alignment errors play in deviations from Mendelian inheritance.
You may contact the first author (during and after the meeting) at