GPU infrastructure and automation tools
The DeepOps project facilitates deployment of GPU servers and multi-node GPU clusters for Deep Learning and HPC environments, in an on-prem, optionally air-gapped datacenter or in the cloud.
Use the provided Ansible playbooks and scripts to deploy Kubernetes, Slurm, or a hybrid of both. This repository encapsulates best practices to make your life easier, but can also be adapted or used in a modular fashion to suite your specific cluster needs. For example: if your organization already has Kubernetes deployed to a cluster, you can still use the optional services and scripts provided to install Kubeflow, enable authentication, or connect NFS storage.
NOTE: we recommend using the most recent release branch for stable code. The
master
branch is used for development and as such may be unstable or even broken at any point in time.
Pick one of the deployment options below if you know what kind of cluster you want. If you feel lost, read through our Getting Started Guide.
Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications.
Consult our Kubernetes Guide to build a GPU-enabled Kubernetes cluster.
For more information on Kubernetes in general, refer to the official Kubernetes docs.
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
Consult our Slurm Guide to build a GPU-enabled Slurm cluster.
For more information on Slurm in general, refer to the official Slurm docs.
A hybrid cluster with both Kubernetes and Slurm can also be deployed. This is recommended for DGX Pod and other setups that wish to make maximal use of the cluster.
Consult our DGX Pod Guide for step-by-step instructionson setting up a hybrid cluster.
For more information on deploying DGX in the datacenter, consult the DGX Data Center Reference Design Whitepaper
We often don't have a full cluster at our disposal, or wish to try DeepOps before we deploy it on the actual cluster. For this purpose, a virtualized version of DeepOps may be deployed on a single node. Very useful for testing, adding new features, or configuring DeepOps to meet your specific needs.
Consult our Virtual Guide to deploy a virtual cluster with DeepOps.
To update your cluster from a previous version of DeepOps to a newer release, please consult the Update Guide.
This project is released under the BSD 3-clause license.
A signed copy of the Contributor License Agreement needs to be provided to deepops@nvidia.com before any change can be accepted.
- Please let us know by filing a new issue
- You can contribute by opening a pull request