buildkite-plugins/docker-buildkite-plugin

Does not work when running buildkite agent from docker container

adragoset opened this issue · 20 comments

This pattern will never work when running the build-kite agent from the official container.
Supposing the buildkite agent is run from buildkite/agent like docker installation and the docker.sock is passed in from the host using a host mount /var/run/docker.sock:/var/run/docker.sock. In this situation the agent and the plugin are both using the host docker installation however the workdir folder only exists in the agent container and not on the host. This causes the workdir bind mount for the plugin to attempt to bind to a non existant host folder. The only correct way to do this is to create a docker volume and copy the agents build context into the docker volume. Once a docker volume exists that would then be mounted to the docker plugin container. Additionally the agent container would have to be responsible for cleaning up volumes of failed or completed builds.

lox commented

This pattern will never work when running the build-kite agent from the official container.

Indeed, it's something we've been trying to figure out a holistic solution too. We should mention that in those docs, apologies that you've run into it. As you say, the issue is that lots of our plugins mount $PWD which will be /buildkite/builds/blah/blah which won't exist on the host.

There are a few solutions. The simplest is to volume mount in the builds directory from the host into the agent container, so that /var/lib/buildkite-agent/builds from the host is mounted in the same place in the container. You will also need /usr/local/bin/buildkite-agent on the host to make it work. This approach unfortunately falls apart in Kubernetes or on Google ContainerOS, as the root partition is mounted read-only and no-exec.

The next most complicated is to support an env for what volumes to mount (see buildkite-plugins/docker-compose-buildkite-plugin#157 for more context). This would be something like BUILDKITE_DOCKER_DEFAULT_VOLUMES=buildkite-builds:/buildkite/builds or similar and all plugins would have to respect this rather than just assuming $PWD can be mounted. This gets complicated. /cc @asford

The other approach is to use https://github.com/buildkite/sockguard or a similar socket proxy to rewrite paths, but that is also very complicated as $PWD might be a sub-dir of a volume, which makes it tricky to mount into a container.

The idea of the agent managing docker volumes for builds is actually a really interesting one that I haven't considered before 🤔We would still need a way for docker-using plugins to figure out what volume to use to mount in the checkout dir.

We manage this in our environment by establishing a single buildkite external volume, covering the entire /buildkite directory. We then mount this into the agent and all docker-compose based build steps via BUILDKITE_DOCKER_DEFAULT_VOLUMES, so that the workdir is available in all contexts at the same path.

We've an example at https://github.com/uw-ipd/buildkite-agent, which is unfortunately a little muddled due to our support for cuda & non-cuda build agents. This same tactic would probably work on a kube-based deployment, but I haven't verified that fact.

@adragoset If you're currently blocked I'd be happy to help add support for this configuration to this plugin.

@lox @asford Thanks for the quick response i think i'll play around with the method mentioned by @asford and attempt to mount a master volume to the agent that holds the entire /buildkite directory and then subsequently attempt mounting that to the docker plugin container. If @asford wants to update config for this plugin that would be great or i can submit a pr once i get it figured out. Im still getting some of my environment setup and purchasing licenses and whatnot so it may take me a couple days to get around too updating the plugin myself. Right now I'm just mounting docker into the agent container and calling docker directly in build scripts to build containers for the test projects i converted to use buildkite for testing agent deployment into my orchestration system.

@adragoset Sounds good. I'm happy to help you port that functionality to this plugin, but won't be able to take point on that change this week. Feel free to @ me on anything related to that work. buildkite-plugins/docker-compose-buildkite-plugin#157 should (hopefully) provide a good skeleton.

🤔 @lox Do you think it'd be worth creating an end-to-end description of this style of deployment someone in/around the buildkite documentation? It seems like folks who gravitate toward the docker-based agent install are most likely customers for docker-based build steps. The fact that these two related modes barely compatible out of the box is a bit confusing, and having the solution embedded across PR and issue threads isn't super discoverable.

lox commented

@asford Yup, I agree we need to do a better job of clarifying it. We've been trying to figure out a good solution to the problem, but in the meantime we need to at least call it out in the documentation.

Is there any update on this issue? The documentation on the website doesn't seem to reflect this problem and I have spent the day trying to work out what I had done wrong.

lox commented

Sorry you've spent a day on this @peterbygrave! It's a difficult docker problem unfortunately, can you tell us a bit more about the specific issues you are facing? Did any of the suggestions in this thread help?

So I shifted to the running native buildkite agent to get it up and running. I will re-visit this when I am sure I have my docker-plugin steps all working.

I would echo @asford comment that end-to-end description is needed before I would be comfortable acting on a solution.

@lox

This approach unfortunately falls apart in Kubernetes or on Google ContainerOS, as the root partition is mounted read-only and no-exec.

Do you know if this behaviour is documented anywhere in the Kubernetes docs? I'm wondering if it can be disabled/modified to make the simple volume mount solution work for me for now. I'm using a kubernetes cluster on google Kubernetes engine.

Oh my bad, not really solved, just prevents the error.

Wait no it is solved, at least for k8s.

Did you set it up manually @nhooyr, or did you use the Helm chart? Would be good to share your setup, for future travelers to this ticket.

Its not really solved, that was my bad. On google Kubernetes engine, instead of mounting /var/buildkite/builds inside the agent (see buildkite/agent#729 (comment)), if you mount /home/kubernetes/flexvolume/buildkite/builds instead, things should work. I'm not quite sure why this directory (/home/kubernetes/flexvolume) isn't documented anywhere. Ik its the kubernetes flex volume plugin directory but GKE doesn't document it uses this over the standard one. This is also a very janky solution and probably not stable.

Unfortunately you'll still see issues like this plugin trying to mount the buildkite-agent through to the container failing because it doesn't exist on the host at /usr/local/bin/buildkite-agent.

I don't seem to have any of the issues that are coming up here, but i do have a problem trying to use buildkite-agent. Whenever i'm trying to upload artifacts from the docker container (buildkite-agent artifact upload ...), i get the following error: /bin/sh: buildkite-agent: Permission denied

Please take a look at buildkite/charts#54 as it might be a solution to the issue.

Hey folks! I believe this issue is fixed, but please let us know if you are still encountering it 🙏🏻 and we'll open the issue again. Thanks!

I think to fully resolve this @pzeballos we might need to add a section to https://buildkite.com/docs/agent/v3/docker about cavaets/limitations/workarounds for using this plugin (and the Docker Compose plugin) on Docker-based agents, and then link to it from this issue and the readmes of both plugins.

Oh! ok, thanks @toolmantim! I'll create PRs to update those docs and link to this issue. Thanks!

toote commented

@toolmantim there is a warning in the official agent running in docker stating:

If your build jobs require Docker access, and you're passing through the Docker socket, you must ensure the build path is consistent between the Docker host and the agent container. See Allowing builds to use Docker for more details..

And the link goes to a sepate section that describes in detail the issues with volume mounting.

As per your suggestion, I created #210 to add a warning in this plugin's readme file linking to those docs