Kubectl NV plugin, a tool for managing NVIDIA objects on a kubernetes cluster. The kubectl-nv plugin is based on https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/, meaning, once installed into the user’s $PATH it will become part of the kubectl command line. As “kubectl nv”.
Currently, when a user faces a malfunction on their kubernetes cluster, users are directed to the documentations page where they are asked to run multiple commands, and if those commands fail to expose the root cause, users are asked to run a must-gather bash script and send the resulting files to support for analysis. With the kubectl nv plugin, users will be able to troubleshoot GPU nodes and clusters from a single command line tool, with customized options, making it easier for them to locally troubleshoot, and if given the case, to produce the files once generated by must-gather.
The kubectl plugin will provide the following troubleshooting commands:
kubectl nv adm must-gather --help
NAME:
kubectl-nv adm must-gather - collects the information from your cluster that is most likely needed for debugging issues
USAGE:
kubectl-nv adm must-gather [command options] [arguments...]
OPTIONS:
--kubeconfig value, -k value path to kubeconfig file (default: "-") [$KUBECONFIG]
--artifacts-dir value path to the directory where the artifacts will be stored. Defaults to /tmp/nvidia-gpu-operator_<timestamp> [$ARTIFACT_DIR]
--help, -h show help