/nephele

Tools to deploy GPU clusters in the Cloud

Primary LanguageHCLApache License 2.0Apache-2.0

NEPHELE

Prerequisites

Install enroot

sudo apt update -y
arch=$(dpkg --print-architecture)
echo $arch

curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.4.1/enroot_3.4.1-1_${arch}.deb
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.4.1/enroot+caps_3.4.1-1_${arch}.deb
sudo apt install -y ./*.deb
rm enroot*

Install terraform and ansible

curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt update
sudo apt install -y build-essential terraform ansible

Check installation

terraform --version
ansible --version

If using Azure, additionally install the Azure CLI

https://docs.microsoft.com/en-us/cli/azure/install-azure-cli

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az login

Setup

Cloning the repo

Remember to include the recursive flag for submodules.

git clone --recursive https://github.com/NVIDIA/nephele.git

Edit cluster configuration

vi nephele.conf

One time setup

export CONTAINERIZED_BUILD=1
./nephele init

Create the cluster

./nephele create

Connect to the cluster

Headnode:

./nephele connect

Specific compute node - e.g. x8a100-0000

./nephele connect x8a100-0000

Destroy the cluster

./nephele destroy