Detecting novel sequence insertions in 3000 individuals from short read sequencing data. B. Kehr1, P. Melsted1,2, A. Jónasdóttir1, A. Jónasdóttir1, A. Sigurðsson1, A. Gylfason1, D. Guðbjartsson1,3, B. V. Halldórsson1,4, K. Stefánsson1,5 1) deCODE genetics/Amgen Inc., Reykjavík, Iceland; 2) Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland; 3) School of Engineering and Natural Sciences, University of Iceland, Reykjavík, Iceland; 4) Institute of Biomedical and Neural Engineering, Reykjavík University, Reykjavík, Iceland; 5) Faculty of Medicine, University of Iceland, Reykjavík, Iceland.
The detection of genomic structural variation (SV) has advanced tremendously in recent years due to progress in high-throughput sequencing technologies. Novel sequence insertions, sequences without similarity to a human reference genome, have received less attention than other types of SVs due to the computational challenges in their detection from short read sequencing data. It inherently involves de novo assembly, which is not only computationally challenging, but also requires high-quality data. While the reads from a single individual may not always meet this requirement, using reads from multiple individuals can increase power to detect novel insertions. We have developed a method to accurately characterize non-reference insertions of 100 base pairs or longer on a population scale. Our input is a mapping of all read sequences and we use a standard assembly tool to generate contigs from unmapped reads. Instead of directly anchoring these contigs into the reference genome, we merge the contigs of different individuals into high-confidence sequences, improving on quality and reliability. Subsequently, we anchor the merged sequences into the reference genome using read-pair information and LD mapping, and identify insertion positions at base-pair resolution using split-reads. Finally, we genotype these variants on 3000 sequenced individuals and impute using pedigree information into 104.000 microarray genotyped individuals, with the goal of associating the presence of an insertion with a disease phenotype. By considering simultaneously the sequence reads of multiple individuals we are able to more accurately determine both the sequences of the insertions and their location. We identify 20% more insertions when considering multiple individuals simultaneously instead of considering each individual separately. We find a large number of novel insertions, varying in frequency from 0.017% to 100%. Insertions of higher frequency commonly have a close homology to a sequence present in other primate genomes, suggesting that the inserted sequence is ancestral to humans. Novel insertions are skewed towards lower frequencies with no homology to primate sequence. Our experimental validation confirms that predicted insertions have a high probability of truly being inserted.
You may contact the first author (during and after the meeting) at