Accurate read mapping using a graph-based human pan-genome. W. Lee1,2, E. Garrison2, D. Kural1,2, G. Marth2,3 1) Seven Bridges Genomics Inc., Cambridge, MA; 2) Department of Biology, Boston College, Chestnut Hill, MA; 3) Department of Human Genetics, the University of Utah, Salt Lake City, UT.
Current short-read mapping algorithms utilize species-specific genome reference sequences to align reads from a newly sequenced individual. Many reads fail to map or are incorrectly mapped because each new genome typically contains many genetic variations not captured by the reference sequence. As a result, while it is possible to detect SNPs and short INDEL variants using such mappings, longer/structural variant alleles and more complex variations are often missed. Furthermore, undetected structural variants in a newly genome often cause mismappings that lead to false positive variant predictions.
As lots of novel variants are discovered by high profile projects, accounting for those novel variants when aligning newly reads becomes imperative and vastly improves sensitivity. This is based on the fact that most of variants found in a single individual are shared in that species. We thus develop a novel whole-genome read mapper that can take into account known variations, in addition to the genome reference, for mapping reads more accurately. Our approach is to construct a directed acyclic graph (DAG) representing the reference sequence and the allelic alternates. Our mapper works in two phases. In a first read localization step, we identify regions where a read is likely to map in the DAG. In a second local alignment step, we align the read against the DAG, using a graph-aware extension of the Smith-Waterman optimal alignment algorithm.
We demonstrate the power of this new read mapper for the detection of mobile element insertions (MEIs) in a human sample. When constructing a DAG using known MEI sites in YRI population in the 1000 Genomes Project, we are able to detect 95% of such sites present in a simulated genome. Similar results are achieved when detecting MEIs in NA12878. Moreover, using our mappings considering known MEIs, we are able to eliminate 95% of falsely called SNPs and INDELs at or near the MEI insertion sites in traditionally mapped sequence alignments. These false positives are almost always caused by mismapping reads containing the MEI sequences in the sample but that are not present in the reference genome. Our mapper, accounting for these insertions within the DAG, is able to correctly align the reads. These initial results indicate that read mapping that accounts for known variations can substantially improve read placement and supports vast improvements in variant calling accuracy.
You may contact the first author (during and after the meeting) at