/deepvariant-on-spark

DeepVariant-on-Spark is a germline short variant calling pipeline that runs Google DeepVariant on Apache Spark at scale.

Primary LanguageShellApache License 2.0Apache-2.0

DeepVariant-on-Spark

DeepVariant-on-Spark is a germline short variant calling pipeline that runs Google DeepVariant on Apache Spark at scale.

Why DeepVariant-on-Spark

  • DeepVariant is highly accurate. In 2016 DeepVariant won PrecisionFDA Truth Challenge in the best SNP Performance category.
  • Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
  • DeepVariant (v0.7) hasn't supported multiple GPUs. Through DeepVariant-on-Spark, all of GPU resources can be fully utilized across multiple nodes. For example, nVidia DGX-1 has 8 Tesla V100.
  • DeepVariant-on-Spark leverages Atgenomix SeqPiper, a wrapper technology using Spark PipeRDD, to parallelize DeepVariant pipeline on Spark and to use Yarn to optimize resource allocation in multi-node environment.

Documentation

Dependence

Quick start and Case studies

Contributing

Interested in contributing? See CONTRIBUTING.

License

DeepVariant-on-Spark is licensed under the terms of the Apache 2.0 License.

Acknowledgements

DeepVariant-on-Spark happily makes use of many open source packages. We'd like to specifically call out a few key ones:

We thank all of the developers and contributors to these packages for their work.

Disclaimer

  • This is not an official Atgenomix product.
  • To utilize the official product with full experience, please contact Atgenomix (info@atgenomix.com).