Help needed configuring oci nvidia hook
pozsa opened this issue · 3 comments
Hello Team,
Could you help me out with the following error? What am I doing wrong? Thank you.
Error
$ sarus run nvidia/cuda:10.0-base nvidia-smi
ERRO[0000] container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"
container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"
Hook configuration
$ cat /opt/sarus/1.3.0-Release/etc/hooks.d/oci-nvidia-hook.json
{
"version": "1.0.0",
"hook": {
"path": "/usr/bin/nvidia-container-toolkit",
"args": ["nvidia-container-toolkit", "prestart"],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
},
"when": {
"always": true,
"commands": [".*"]
},
"stages": ["prestart"]
}
$ which nvidia-smi
/usr/bin/nvidia-smi
$ which nvidia-container-toolkit
/usr/bin/nvidia-container-toolkit
$ sudo docker run --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:10.0-base nvidia-smi
Thu Oct 22 16:11:40 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K2000 Off | 00000000:07:00.0 Off | N/A |
| 30% 38C P0 N/A / N/A | 0MiB / 1999MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Hello @pozsa,
your oci-nvidia-hook.json
seems fine, so I would first check if Sarus is scheduling the NVIDIA hook to be executed by the OCI runtime.
In Sarus debug output (sarus --debug run [...]
) you should find two groups of log entries, the first one related to the acquisition of the hook JSON file, e.g:
[382123.468065307] [hostname-123456] [runtime] [INFO] Creating OCI hook object from "/opt/sarus/1.3.0-Release/etc/hooks.d/oci-nvidia-hook.json"
[382123.469203591] [hostname-123456] [runtime] [DEBUG] Created OCI Hook's "always" condition (true)
[382123.469215194] [hostname-123456] [runtime] [INFO] Successfully created OCI hook object
and the second one related to the evaluation of the when
conditions, which determine if the hooks will be included in the OCI bundle's config.json, e.g.:
[382123.478463933] [hostname-123456] [runtime] [INFO] Evaluating "when" conditions of OCI Hook "/opt/sarus/1.3.0-Release/etc/hooks.d/oci-nvidia-hook.json"
[382123.478471396] [hostname-123456] [runtime] [DEBUG] OCI Hook's "always" condition evaluates "true"
[382123.478477613] [hostname-123456] [runtime] [INFO] OCI Hook is active
In your case, you should also find entries related to the evaluation of the commands
condition.
If this all checks out, then I would proceed to verify why the hook is not being effective.
As mentioned in the documentation for the support of the NVIDIA Container Toolkit at runtime, Sarus relies on the value of CUDA_VISIBLE_DEVICES
from the host to set the NVIDIA_VISIBLE_DEVICES
environment variable, which in turn controls the actions of the NVIDIA hook.
This is done to work seamlessly with workload managers like Slurm, which set CUDA_VISIBLE_DEVICES
, but do not have notion of NVIDIA_VISIBLE_DEVICES
.
If CUDA_VISIBLE_DEVICES
is not set in the host, Sarus will unset NVIDIA_VISIBLE_DEVICES
, making the NVIDIA hook exit without carrying out any operation.
If CUDA_VISIBLE_DEVICES
is set but the problem persists, then we'll need to dig deeper to understand what's happening.
Hi @Madeeks ,
Thank you!
$ export CUDA_VISIBLE_DEVICES=all
solved the issue.
If
CUDA_VISIBLE_DEVICES
is not set in the host, Sarus will unsetNVIDIA_VISIBLE_DEVICES
, making the NVIDIA hook exit without carrying out any operation.
Wouldn't it make more sense to simply not modify NVIDIA_VISIBLE_DEVICES
in this case, instead of unsetting it? or maybe just set it to none
?
Wouldn't it make more sense to simply not modify NVIDIA_VISIBLE_DEVICES in this case, instead of unsetting it? or maybe just set it to none?
As explained in the official doc for NVIDIA_VISIBLE_DEVICES, setting a none
value would make no GPU accessible in the container, yet the NVIDIA driver libraries/binaries will still be mounted and enabled , according to the requested driver capabilities.
At the moment, we are not seeing a lot of usefulness of having the driver without actually being able to access any GPU device, and we prefer to modify the container filesystem only if necessary, thus Sarus is unsetting NVIDIA_VISIBLE_DEVICES
for a no-op from the NVIDIA Container Toolkit.
Closing the issue as the original issue has been solved.