kubectl-nv

Kubectl NV plugin, a tool for managing NVIDIA objects on a kubernetes cluster. The kubectl-nv plugin is based on https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/, meaning, once installed into the user’s $PATH it will become part of the kubectl command line. As “kubectl nv”.

Currently, when a user faces a malfunction on their kubernetes cluster, users are directed to the documentations page where they are asked to run multiple commands, and if those commands fail to expose the root cause, users are asked to run a must-gather bash script and send the resulting files to support for analysis. With the kubectl nv plugin, users will be able to troubleshoot GPU nodes and clusters from a single command line tool, with customized options, making it easier for them to locally troubleshoot, and if given the case, to produce the files once generated by must-gather.

The kubectl plugin will provide the following troubleshooting commands:

adm

must-gather

kubectl nv adm must-gather --help
NAME:
   kubectl-nv adm must-gather - collects the information from your cluster that is most likely needed for debugging issues

USAGE:
   kubectl-nv adm must-gather [command options] [arguments...]

OPTIONS:
   --kubeconfig value, -k value  path to kubeconfig file (default: "-") [$KUBECONFIG]
   --artifacts-dir value         path to the directory where the artifacts will be stored. Defaults to /tmp/nvidia-gpu-operator_<timestamp> [$ARTIFACT_DIR]
   --help, -h                    show help

NVIDIA/kubectl-nv

kubectl-nv

adm

must-gather