eth-cscs/sarus

Help needed configuring oci nvidia hook

pozsa opened this issue · 3 comments

pozsa commented

Hello Team,

Could you help me out with the following error? What am I doing wrong? Thank you.

Error

$ sarus run nvidia/cuda:10.0-base nvidia-smi
ERRO[0000] container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"
container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"

Hook configuration

$ cat /opt/sarus/1.3.0-Release/etc/hooks.d/oci-nvidia-hook.json
{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/bin/nvidia-container-toolkit",
        "args": ["nvidia-container-toolkit", "prestart"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ]
    },
    "when": {
        "always": true,
        "commands": [".*"]
    },
    "stages": ["prestart"]
}
$ which nvidia-smi
/usr/bin/nvidia-smi
$ which nvidia-container-toolkit
/usr/bin/nvidia-container-toolkit
$ sudo docker run --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:10.0-base nvidia-smi
Thu Oct 22 16:11:40 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K2000        Off  | 00000000:07:00.0 Off |                  N/A |
| 30%   38C    P0    N/A /  N/A |      0MiB /  1999MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Hello @pozsa,
your oci-nvidia-hook.json seems fine, so I would first check if Sarus is scheduling the NVIDIA hook to be executed by the OCI runtime.
In Sarus debug output (sarus --debug run [...]) you should find two groups of log entries, the first one related to the acquisition of the hook JSON file, e.g:

[382123.468065307] [hostname-123456] [runtime] [INFO] Creating OCI hook object from "/opt/sarus/1.3.0-Release/etc/hooks.d/oci-nvidia-hook.json"
[382123.469203591] [hostname-123456] [runtime] [DEBUG] Created OCI Hook's "always" condition (true)
[382123.469215194] [hostname-123456] [runtime] [INFO] Successfully created OCI hook object

and the second one related to the evaluation of the when conditions, which determine if the hooks will be included in the OCI bundle's config.json, e.g.:

[382123.478463933] [hostname-123456] [runtime] [INFO] Evaluating "when" conditions of OCI Hook "/opt/sarus/1.3.0-Release/etc/hooks.d/oci-nvidia-hook.json"
[382123.478471396] [hostname-123456] [runtime] [DEBUG] OCI Hook's "always" condition evaluates "true"
[382123.478477613] [hostname-123456] [runtime] [INFO] OCI Hook is active

In your case, you should also find entries related to the evaluation of the commands condition.

If this all checks out, then I would proceed to verify why the hook is not being effective.
As mentioned in the documentation for the support of the NVIDIA Container Toolkit at runtime, Sarus relies on the value of CUDA_VISIBLE_DEVICES from the host to set the NVIDIA_VISIBLE_DEVICES environment variable, which in turn controls the actions of the NVIDIA hook.
This is done to work seamlessly with workload managers like Slurm, which set CUDA_VISIBLE_DEVICES, but do not have notion of NVIDIA_VISIBLE_DEVICES.
If CUDA_VISIBLE_DEVICES is not set in the host, Sarus will unset NVIDIA_VISIBLE_DEVICES, making the NVIDIA hook exit without carrying out any operation.
If CUDA_VISIBLE_DEVICES is set but the problem persists, then we'll need to dig deeper to understand what's happening.

pozsa commented

Hi @Madeeks ,
Thank you!
$ export CUDA_VISIBLE_DEVICES=all solved the issue.

If CUDA_VISIBLE_DEVICES is not set in the host, Sarus will unset NVIDIA_VISIBLE_DEVICES, making the NVIDIA hook exit without carrying out any operation.

Wouldn't it make more sense to simply not modify NVIDIA_VISIBLE_DEVICES in this case, instead of unsetting it? or maybe just set it to none?

Wouldn't it make more sense to simply not modify NVIDIA_VISIBLE_DEVICES in this case, instead of unsetting it? or maybe just set it to none?

As explained in the official doc for NVIDIA_VISIBLE_DEVICES, setting a none value would make no GPU accessible in the container, yet the NVIDIA driver libraries/binaries will still be mounted and enabled , according to the requested driver capabilities.
At the moment, we are not seeing a lot of usefulness of having the driver without actually being able to access any GPU device, and we prefer to modify the container filesystem only if necessary, thus Sarus is unsetting NVIDIA_VISIBLE_DEVICES for a no-op from the NVIDIA Container Toolkit.

Closing the issue as the original issue has been solved.