SpeedSeq: A 24-hour alignment, variant calling, and genome interpretation pipeline. C. Chiang1, R. M. Layer2, G. G. Faust1, M. R. Lindberg1, A. R. Quinlan1,3,4, I. M. Hall1,3,4 1) Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908; 2) Computer Science, University of Virginia, Charlottesville, VA 22903; 3) Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22904; 4) Department of Public Health Sciences, University of Virginia, Charlottesville VA 22908.
Bioinformatic turn-around time is currently a major obstacle for clinical adoption of genome sequencing technologies. For many whole genome sequencing applications such as cancer genotyping or newborn diagnosis, the clinically actionable timeframe is days or weeks. While the time required to generate whole genome sequencing reads has been reduced from ~2 weeks to ~3 days, bioinformatic analysis remains a major challenge, typically requiring weeks or months to go from raw DNA sequence data to causal variants, with extensive hands-on involvement. We set out to systematically reduce bioinformatic turn-around time and simplify variant interpretation without sacrificing accuracy. To this end, we present SpeedSeq, a rapid and comprehensive pipeline for characterizing and prioritizing genetic variation in human genomes. We show that our ultra-fast pipeline produces high-quality SNV and indel calls with specificity and sensitivity on par with current standards. In a paired tumor/normal analysis, SpeedSeq achieves high recall and precision rates even for subclonal variants, and near perfect recall of orthogonally validated mutations in tumors from The Cancer Genome Atlas (TCGA). We further show that our structural variation detection approach (LUMPY) significantly outperforms other available tools, and we have validated variant detection power against available gold standards from the 1000 Genomes Project and Genome in a Bottle Consortium. Additionally, updates to the GEMINI framework can identify actionable mutations in a clinically relevant timeframe with minimal human involvement. In under 24-hours, our approach moves raw sequence data to fully processed variant calls with genetic implications. Our pipeline is composed entirely of free open source software tools including BWA, Samblaster, Sambamba, BEDTools, Freebayes, LUMPY, SnpEff, GEMINI, and DGIdb, as well as several new data processing tools designed to greatly increase speed. SpeedSeq is available on Github at https://github.com/cc2qe/speedseq.
You may contact the first author (during and after the meeting) at