Implementing a High Performance, Reusable Consensus Calling Pipeline for Next Generation Sequencing using Globus Genomics. R. K. Madduri1, A. Rodriguez1, V. Trubetskoy2, L. K. Davis2, P. J. Dave1, N. J. Cox2, I. T. Foster1 1) Computation Institute, University of Chicago, Chicago, IL; 2) Section Genetic Medicine, University of Chicago, Chicago, IL.
We developed Globus Genomics (http://globus.org/genomics/), an end-to-end hosted service designed to efficiently and easily analyze large quantities of Next Generation Sequencing (NGS) data using state of the art algorithms, efficient data management tools, a graphical web-based workflow environment and on-demand computing infrastructure.Globus Genomics leverages a collection of existing cloud-based services. Globus Genomics users, however, can build new analysis workflows from scratch. Users can analyze large amounts of data using computationally efficient analytical pipelines and cutting edge tools that leverage the power and flexibility of on-demand cloud computing resourceswithout being exposing to the complexities of managing large scale infrastructure; deploying and configuring analysis tools; transferring data between sequencers, analysis nodes and storage systems; or managing their own users and groups. To this end, we use elastic computational infrastructure provided by Amazon Web Services. We use the Condor scheduler to manage a dynamically assembled pool of hosts. We outsource high performance data transfer and user, group and credential management to Globus Online, a platform as a service (PaaS) provider also developed and operated by our team. Finally, we host a Galaxy workflow system to enable easy to use graphical workflow orchestration. We created computational profiles for multiple variant calling and genotyping algorithms available for academic use (i.e., GATK2.0, Atlas2.0, and FreeBayes toolkits). These profiles enable high performance, scalable execution of algorithms on hundreds of raw data sets. We built reusable, robust pipelines using different computational modalities that best suited the underlying analysis. The resulting variant calls from each pipeline can then be fed to a consensus-calling algorithm (Consensus Genotyper for Exome Sequencing CGES; see Trubetskoy et al., ASHG 2013) resulting in high quality variant and genotype calls. We have run these three pipelines in parallel calling variants on over a hundred raw BAM files in the course of three days. Atlas2.0 and the GATK pipelines took a little over two days to finish execution while Freebayes pipeline took a little over three days. In conclusion, we present the workings of Globus Genomics, a robust, powerful, and user-friendly suite of tools for NGS analysis empowering geneticists and enabling translational discovery relevant to human disease.
You may contact the first author (during and after the meeting) at