buildkite-plugins/docker-buildkite-plugin

GPU parameters not passing in Docker Run {--gpus 1}

scapoor opened this issue · 3 comments

Pipeline Steps file as on the Build-kite.

steps:
  - label: "Testing Mnist on Docker from Containerized Agent"
    agents:
    - "gpu=true"
    command:
      - python mnist.py
    plugins:
      - docker#v3.3.0:
          image: "nvcr.io/nvidia/pytorch:18.05-py3"
          gpus: "1"

This pipeline is not adding the --gpus 1 in the docker run command. The pipeline succeeds but the workload is on the CPU and not the GPU.

The command in the logs is as follows:

docker run -it --rm --init --volume /var/lib/buildkite/builds/Obliex-1/ekkam/test-pipeline:/workdir --workdir /workd
ir --env BUILDKITE_JOB_ID --env BUILDKITE_BUILD_ID --env BUILDKITE_AGENT_ACCESS_TOKEN --volume /usr/local/bin/buildkite-
agent:/usr/bin/buildkite-agent --label com.buildkite.job-id=2867eef0-13c0-4f29-ab43-a18cf31ca991 nvcr.io/nvidia/pytorch:
18.05-py3 /bin/sh -e -c python\ mnist.py

Initially I though there might be some issue with the agent container, however it was not. I was able to run the same docker run command by manually adding the --gpus 1 in it and a new container is spawned with the GPU runtime and it was processing on the GPU. Below command when running directly on the agent container works as intended.

docker run -it --rm --gpus 1 --init --volume /var/lib/buildkite/builds/Obliex-1/ekkam/test-pipeline:/workdir --workdir /workd
ir --env BUILDKITE_JOB_ID --env BUILDKITE_BUILD_ID --env BUILDKITE_AGENT_ACCESS_TOKEN --volume /usr/local/bin/buildkite-
agent:/usr/bin/buildkite-agent --label com.buildkite.job-id=2867eef0-13c0-4f29-ab43-a18cf31ca991 nvcr.io/nvidia/pytorch:
18.05-py3 /bin/sh -e -c python\ mnist.py

Surprisingly this Pipeline works (Without the Plugin):

steps:
  - label: "Nvidia-Benchmark"
    agents:
    - "gpu=true"
    command:
      - docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

I guess the issue is either some sort of incompatibility with the way Nvidia selects the runtime using hooks associated with the argument --gpus=all or something might be broken in the plugin.

BuildKite-Agent Launch Command:

docker run -d --name buildkite-agent -v "/mnt/d/DockerDataStore/buildkite/config/buildkite-agent.cfg:/buildkite/buildkite-agent.cfg:ro" -v "/var/lib/buildkite/builds:/var/lib/buildkite/builds" -v "/var/run/docker.sock:/var/run/docker.sock" -v "/mnt/d/DockerDataStore/buildkite/hooks:/buildkite/hooks:ro" -v "/mnt/d/DockerDataStore/buildkite/secrets:/buildkite-secrets:ro" -v "/mnt/d/DockerDataStore/buildkite/plugins:/buildkite/plugins" buildkite/agent

BuildKite-Agent Config:

token="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
name="Obliex-%n"
tags="ci=true,docker=true,queue=default,gpu=true"
git-clean-flags="-ffdqx"
debug=true
build-path="/var/lib/buildkite/builds"
hooks-path="/buildkite/hooks"
plugins-path="/buildkite/plugins"

Same issue on containerized buildkite agent with docker on Windows (WSL2) and Ubuntu (20.04)
OS: Windows 10 Enterprise 21H2 19044.1706
Docker Desktop 4.8.2 (79419)
Docker version 20.10.12, build 20.10.12-0ubuntu4

Nvidia-Contailer-Runtime

nvidia-container-runtime --version
runc version 1.1.0-0ubuntu1
spec: 1.0.2-dev
go: go1.17.3
libseccomp: 2.5.3

Nvidia-Container-Cli

cli-version: 1.9.0
lib-version: 1.9.0
build date: 2022-03-18T13:46+00:00
build revision: 5e135c17d6dbae861ec343e9a8d3a0d2af758a4f
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Questions:

  1. Am I using the GPU syntax incorrectly in the Pipeline's Steps file?
  2. Is it a bug or something else is going on that can't be seen in the logs?
  3. Is there any dependency on specific Docker Version or specific Nvidia Driver, that may be causing this issue?

After going through a ton of documentation, I kind of figured a workaround.
Documenting for everyone who faces the same problem.

What I did?
I fixed it by changing the default runtime in Dockerd Daemon Config nvidia-container-runtime. The line default-runtime is missing (and not needed) when Nvidia-Container-Runtime is setup. The container creation requires the --gpus flag or the NVIDIA_VISIBLE_DEVICES environment variable to switch to nvidia runtime for that container.

# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

What I understood?

Nvidia-Container-Runtime is kind of a wrapper over the runc and if the --gpus flag or NVIDIA_VISIBLE_DEVICES environment variable is passed to the docker run, then it detects using a custom hook 1 2 and switches the runtime to Nvidia Runtime else uses the default runc or containerd by default. 3 4
This is a workaround because Nvidia might change the default behavior of the runtime custom hook or change the wrapped runtime into something dedicated, so might break in the future. As this is more or less experimental 5, but should get standardized in the future since it's based on the Open Container Initiative's Specs.

I think by making the Nvidia Runtime default the GPUs are connected to all of the containers created but since these container image may not have the drivers, the Nvidia Runtime behaves like runc or containerd. (Assumption from my end, could be wrong.)

To check which runtimes are installed on your system in docker, run the following command.
docker info|grep -i runtime shows all the available runtime inside the container.

Output
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: nvidia

Once the runtime has been made default, the pipelines would work regardless of the --gpus flag or NVIDIA_VISIBLE_DEVICES environment variable.

E.g. Working Pipeline - 1

steps:
  - label: "Testing Mnist on Docker from Containerized Agent"
    agents:
    - "gpu=true"
    command:
      - python mnist.py
    plugins:
      - docker#v3.3.0:
          image: "nvcr.io/nvidia/pytorch:18.05-py3"

Working Pipeline 2

steps:
  - label: "Nvidia-SMI"
    agents:
    - "gpu=true"
    command:
      - nvidia-smi
    plugins:
      - docker#v3.3.0:
          image: "nvidia/cuda:11.0-base"

This is a workaround and not a solution, hence not closing the issue.
I would wait for the team to figure out why ---gpus flag is missing in the logs when agent creates container using docker run even though the Pipeline's Steps File has been defined with the gpus: all.


Footnotes

  1. Open Container Initiative - Runtime Spec

  2. Open Container Initiative - Runtime Spec - CreateRuntime Hook

  3. Nvidia Container Runtime ReadME

  4. Nvidia Container Runtime ReadME - DockerCLI

  5. Nvidia Container Runtime ReadME - Experimental Mode

toote commented

Hi @scapoor! Sorry about the delay in getting back to you, specially after all the time you appear to have dedicated to this. We really appreciate all the information you provided, it is extremely thorough 😍 . Rest assured that I did go over the whole thing several times to make sure I understood your environment and configuration.

The behaviour described is indeed extremely weird... specially the part where you mentioned you were able to make it work. I believe the lack of --gpus option stems from the fact that the gpus option in the plugin was added on version 3.10.0 and your examples indicate that you are using version 3.3.0. That would mean it would silently fail and not add the option to the command itself.

If you think that it was not that, please re-open this issue so that we can further debug!

Hi @scapoor! Sorry about the delay in getting back to you, specially after all the time you appear to have dedicated to this. We really appreciate all the information you provided, it is extremely thorough 😍 . Rest assured that I did go over the whole thing several times to make sure I understood your environment and configuration.

The behaviour described is indeed extremely weird... specially the part where you mentioned you were able to make it work. I believe the lack of --gpus option stems from the fact that the gpus option in the plugin was added on version 3.10.0 and your examples indicate that you are using version 3.3.0. That would mean it would silently fail and not add the option to the command itself.

If you think that it was not that, please re-open this issue so that we can further debug!

@toote Thanks for the explanation. It was a “😲” moment, considering I completely missed on the code for the --gpus option. The reason I was able to make it work was because I tweaked the docker environment to use all available gpus by default. I kind of worked around the limitation of that missing gpus option in code.😅 Did it the hard way, I guess.🤣

Anyway, really grateful for the explanation, would definitely try after removing the workaround with the newer version.