DeepVariant-on-Spark

DeepVariant-on-Spark is a germline short variant calling pipeline that runs Google DeepVariant on Apache Spark at scale.

Why DeepVariant-on-Spark

DeepVariant is highly accurate. In 2016 DeepVariant won PrecisionFDA Truth Challenge in the best SNP Performance category.
Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
DeepVariant (v0.7) hasn't supported multiple GPUs. Through DeepVariant-on-Spark, all of GPU resources can be fully utilized across multiple nodes. For example, nVidia DGX-1 has 8 Tesla V100.
DeepVariant-on-Spark leverages Atgenomix SeqPiper, a wrapper technology using Spark PipeRDD, to parallelize DeepVariant pipeline on Spark and to use Yarn to optimize resource allocation in multi-node environment.

Interested in contributing? See CONTRIBUTING.

DeepVariant-on-Spark is licensed under the terms of the Apache 2.0 License.

DeepVariant-on-Spark happily makes use of many open source packages. We'd like to specifically call out a few key ones:

We thank all of the developers and contributors to these packages for their work.

This is not an official Atgenomix product.
To utilize the official product with full experience, please contact Atgenomix (info@atgenomix.com).