Using compressed data structures to capture variation in thousands of human genomes. S. A. McCarthy1, Z. Lui1, J. T. Simpson2, Z. Iqbal3, T. M. Keane1, R. Durbin1 1) Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom; 2) Ontario Institute for Cancer Research, Toronto, Ontario, Canada; 3) Wellcome Trust Centre for Human Genetics, Oxford, United Kingdom.
Currently the most widely used approach to catalogue variation amongst a set of samples is to align the sequencing reads to a single linear reference genome. This principle has been at the core of the 1000 Genomes data processing pipeline since the pilot phase of the project. However, there is now an increased awareness of the limitations of this approach, such as alignment artefacts, reference bias and unobserved variation on non-reference haplotypes. The Burrows-Wheeler transform and FM-index are compact data structures that have been successfully used in sequence alignment and assembly. One of the key features of these structures is that they are a searchable and reference-free representation of the raw sequencing reads. Our project aims to build a web server based on BWT data structures containing all the reads from many thousands of samples so as to efficiently retrieve matching reads and information about samples and populations. Enticingly, it is expected that data storage for this system would plateau as we collect more data since most new sequencing reads will have already been observed. We expect this to enable powerful new ways to query variation data from thousands of individuals. For the first phase of this project, we include all 87 Tbp of the low-coverage and exome data from the 2,535 samples in 1000 Genomes Phase 3. We envisage this would provide a means for researchers to easily check the prevalence of any human sequence in a control set of thousands of putatively healthy samples. We present our approaches and initial benchmarks on variant sensitivity and specificity against truth datasets and explore several applications for these structures such as validation of short insertion/deletion and structural variant calls, and rapid searching for traces of viral DNA.
You may contact the first author (during and after the meeting) at