A generalized human reference as a graph of genomic variation. E. Garrison1, D. Kural1,2, A. Ward1, W. P. Lee1, G. Marth1 1) Biology, Boston College, Chestnut Hill, MA; 2) Seven Bridges Genomics, Cambridge, MA.
The linear reference genome provides a straightforward basis for analysis, but this convenience also limits the ability of researchers to understand complex forms of genomic variation. The short reads used in resequencing studies must be mapped to the single haplotype of the reference genome, generating ascertainment bias towards small variants that are unlikely to disrupt read placement, such as SNPs and short indels. Consequently, the detection of more complex divergences from the reference genome, such as longer indels, structural variants, and clustered variants requires large expenditures in sequencing and analysis costs.
Much of genomic variation in humans is shared, and thus the haplotypes detected in many individuals can be pooled into a combined reference containing the vast majority of variation likely to be found in a newly-sequenced sample. To generate this combined representation of genomic variation, we propose a graph genome reference (GGR). Nodes in the GGR are sequences observed in the population and edges represent possible linkages between adjacent sequences. The haplotypes of individuals in a population are thus a subset of possible paths through this graph, a property which allows the use of the graph as a tool to reduce bias when detecting known sequence variants via resequencing.
Here, we describe the application of this structure to variant detection. We have developed a method to align short reads to sequence graphs, which we use to collect evidence for putative or known variants in a variety of analytical contexts. By realigning short sequence reads to a GGR of putative variants, we produce an ensemble variant detector capable of coherently integrating signals from a wide array of resequencing and assembly-based variant detection approaches. Our method enables the accurate characterization of contexts in which variants overlap or are embedded in other variants. The application of haplotype-based variant detection to reads aligned to a GGR allows the determination of physical linkage of variants using primary sequencing observations. Using a GGR of known, high-confidence variants as a basis for mapping provides the benefits of multi-sample variant detection without requiring the centralized analysis of raw sequencing data.
You may contact the first author (during and after the meeting) at