Exploring Genetic Variation and Genotypes Among Millions of Genomes. R. M. Layer, A. R. Quinlann University of Virginia, Charlottesville, VA.

   Rare, and thus largely unknown, variants are a major reason that, typically, less than 10% of the heritability of complex diseases currently can be explained by known genetic variation. While increasing the number of sequenced genomes may improve our ability to reveal this hidden heritability, the scale of the resulting dataset poses substantial storage and computational demands. Current efforts to sequence 100,000 genomes, and combined efforts that are likely to surpass 1 million genomes will identify hundreds of millions to billions of polymorphic loci. The minimum storage requirement for directly representing the variability found by these projects (1 bit per individual per variant, ignoring the necessary metadata) will range from terabytes to petabytes. Like most big-data problems, a balance must be found between optimizing storage and computational efficiency. For example, while compression can minimize storage by reducing file size, it can also cause inefficient computation since data must be decompressed before it can be analyzed. Conversely, highly structured data can reduce analysis times but typically require extra metadata that increase file size. Current variation storage schemes were not designed to quickly analyze massive datasets and fail to balance these competing goals. We present GENOTQ, an open source API and toolkit that reduces file size and data access time through use of a succinct data structure, a class of data structures that compress data such that operations can be performed without requiring the full decompression. Word aligned hybrid (WAH) bitmap compression is one such data structure that was developed to improve query times for relational databases. Binary values are encoded such that logical operations (AND, OR, NOT) can be performed on the compressed data. This encoding results in file sizes that are 20X smaller than uncompressed versions, and only 50% larger than the compressed version. Queries, such as finding shared variants among a subpopulation, are also 21X faster. Furthermore, representing the genotypes in this manner makes our method well suited to both distributed architectures like BigQuery and parallel processors like GPUs. We stress that this method is only part of a larger solution that would incorporate genomic annotations, medical histories, and pedigrees. Incorporating fast genotype queries with this web of metadata will provide a rich information source to both clinicians and researchers.

You may contact the first author (during and after the meeting) at