Bind mount of `/run` or `/var/run` into container causes host network to become unresponsive

Question

Bind mount of `/run` or `/var/run` into container causes host network to become unresponsive

michelkaeser opened this issue 5 years ago · 13 comments

michelkaeser commented 5 years ago

Hello

I tried to make the sysbox runtime work for our Jenkins setup but it ends up crashing (the whole Docker daemon).

System information

AWS EC2 Instance
Ubuntu 18.04 Server (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200112 (ami-0b418580298265d5c))
Upgraded kernel as documented via backports (uname -r = 5.3.0-40-generic)

Jenkins Volumes (run directly on host using regular Docker runtime)

/var/run/docker.sock:/var/run/docker.sock
jenkins-workspace:/var/jenkins_home

Jenkins Pipeline

pipeline {
    ...
    agent {
        docker {
            image 'censored:latest'
            args  '--runtime=sysbox-runc -v jenkins-maven-repository:/home/jenkins/.m2'
        }
    }
    ...
}

For completness, this is what Jenkins tries to run:

docker run -t -d -u 1000:1000 --runtime=sysbox-runc -v jenkins-maven-repository:/home/jenkins/.m2 -w /var/jenkins_home/jobs/maven-pipeline-ng/workspace --volumes-from 9f65cddc6f6cf5951e78cc7bbd67f64040b5877919c8860d0d550a1c6a3905cf -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** censored:latest cat

So basically the Jenkins container (which is a regular one) should start pipeline agents with sysbox runtime -> Agents will be siblings of Jenkins but be able to have isolated Docker inside

This leads to the following error however:

Failed to run image 'censored:latest'. Error: docker: Error response from daemon: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:474: container init caused \"process_linux.go:441: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: time=\\\\\\\"2020-03-06T13:50:22Z\\\\\\\" level=fatal msg=\\\\\\\"dial unix /var/run/docker/libnetwork/c0d26181f7af.sock: connect: connection refused\\\\\\\"\\\\n\\\"\"": unknown.

The whole server (not only Docker, literally everything) starts to behave strange afterwards and the only solution is to perform a reboot via AWS console (doesn't really work too). I.e. running reboot -h now results in:

Failed to connect to bus: Connection refused
Failed to open /dev/initctl: No such device or address
Failed to talk to init daemon.

Manually creating censored:latest containers within the Jenkins container works and Docker within such a container works as well when done without -v jenkins-maven-repository:/home/jenkins/.m2 -w /var/jenkins_home/jobs/maven-pipeline-ng/workspace --volumes-from 9f65cddc6f6cf5951e78cc7bbd67f64040b5877919c8860d0d550a1c6a3905cf -e flags - so the problem must be with these.

Currently access to the server is not possible (stuck and un-stoppable) so I cannot further break down which of the flags alone is a problem (maybe --volumes-from?). I will try once I regain access to the server, but hopefully the provided information is enough anyways.

Thanks.

Answer 1 · 2020-03-06T16:59:51.000Z

Hi Michel,

Thanks for trying Nestybox and for filing the issue with the detailed write-up.

I'll take a look at the problem a bit later today. We've not tried the configuration you are using (Jenkins master using a regular Docker container + Jenkins agents using Nestybox system containers), but it should work. The docker run --runtime=sysbox-runc command that Jenkins is issuing looks fine.

By the way, we have a blog article describing a different way to setup Jenkins using Docker + Nestybox, one which embeds the Jenkins master, docker daemon, and Jenkins agents all within a system container.

In any case, the configuration you are using should work too, so I'll take a look to see what's happening.

Thanks!

Answer 2 · 2020-03-06T23:42:49.000Z

Starting to investigate this issue.

First thing I tried is the --volumes-from Docker option and I don't see any issues with it:

I created a Docker volume and mounted it into a regular Docker container. Then I created a system container with Docker + Sysbox using the --volumes-from option. It works without problem:

cesar@disco1:$ docker volume ls
DRIVER              VOLUME NAME
local               testvol

cesar@disco1:$ docker run -d -v testvol:/mnt/testvol alpine tail -f /dev/null
d22ff7593da5b7da877ffb48532b88bd393aa33850db23f0dce9709032e2f3b6

cesar@disco1:$ docker ps
CONTAINER ID        IMAGE               COMMAND               CREATED             STATUS              PORTS               NAMES
d22ff7593da5        alpine              "tail -f /dev/null"   19 seconds ago      Up 19 seconds                           goofy_murdock

cesar@disco1:$ docker run --runtime=sysbox-runc -it --volumes-from goofy_murdock alpine 
/ # ls -l /mnt
total 4
drwxr-xr-x    2 root     root          4096 Mar  6 16:30 testvol
/ # ls -l /mnt/testvol
total 0

I suspect the problem is somewhere else. Will try to repro it using the same Jenkins setup reported by Michel.

Answer 3 · 2020-03-06T23:50:16.000Z

For sanity, verified that the -v -w and -e options to Docker run also work as expected when using the sysbox runtime:

chino@disco1:$ docker run --runtime=sysbox-runc -it -v testvol2:/mnt/testvol2 --volumes-from goofy_murdock -w /mnt/testvol -e TESTVAR=someval alpine 

/mnt/testvol # tree /mnt
/mnt
├── testvol
└── testvol2

2 directories, 0 files

/mnt/testvol # findmnt | grep testvol
├─/mnt/testvol                        /mnt/sdb/docker/volumes/testvol/_data                                                                     shiftfs  rw,relatime
├─/mnt/testvol2                       /mnt/sdb/docker/volumes/testvol2/_data                                                                    shiftfs  rw,relatime

/mnt/testvol # echo $TESTVAR
someval

Answer 4 · 2020-03-07T05:09:29.000Z

I was able to reproduce the problem locally following Michel's instructions.

Once the error occurs, the network subsystem of the machine is gone (there is no network connection, no connection to dockerd, no connection to systemd, etc.)

I looked at the config.json that the Sysbox runtime is receiving from Docker. I suspect the problem is related to the following prestart hook.

{                                                                                                                                                                                                                                                                                
    "hooks": {                                                                                                                                                                                                                                                                   
        "prestart": [                                                                                                                                                                                                                                                            
            {                                                                                                                                                                                                                                                                    
                "args": [                                                                                                                                                                                                                                                        
                    "libnetwork-setkey",                                                                                                                                                                                                                                         
                    "-exec-root=/var/run/docker",                                                                                                                                                                                                                                
                    "7c4d941b7d22243e9d99d8a3f0515ebc3f91e8953326f0e6c59149d8c922a74a",                                                                                                                                                                                          
                    "2c47a7115c6c08cd1e41ad56b84a0895fe182b92d068bdfbfcb13a327e7f6957"                                                                                                                                                                                           
                ],                                                                                                                                                                                                                                                               
                "path": "/proc/7138/exe"                                                                                                                                                                                                                                         
            }                                                                                                                                                                                                                                                                    
        ]                                                                                                                                                                                                                                                                        
    },

I'll investigate whether this is the culprit and why it causes the machine's network subsystem to go down so badly.

But I think there is another more basic issue at play, and it has to do with the setup of the Jenkins master: since the master is running inside a container, it's generating Docker commands that make sense within the container's context. But since the Docker daemon runs at host level, those Docker commands are executing at host level; the commands don't always make sense within that context. I'll confirm this too. The blog post I mentioned earlier sets up things in a way that solves the context related problems.

Answer 5 · 2020-03-07T19:44:17.000Z

Following up on this:

But I think there is another more basic issue at play, and it has to do with the setup of the Jenkins master: since the master is running inside a container, it's generating Docker commands that make sense within the container's context. But since the Docker daemon runs at host level, those Docker commands are executing at host level; the commands don't always make sense within that context. I'll confirm this too.

For example, I can see that Docker is telling the sysbox runtime to mount the host's docker socket (/var/run/docker.sock) into the jenkins agent container:

        {                                                                                                                                                                                                                                                                        
            "destination": "/var/run/docker.sock",                                                                                                                                                                                                                               
            "options": [                                                                                                                                                                                                                                                         
                "rbind",                                                                                                                                                                                                                                                         
                "rprivate"                                                                                                                                                                                                                                                       
            ],                                                                                                                                                                                                                                                                   
            "source": "/var/run/docker.sock",                                                                                                                                                                                                                                    
            "type": "bind"                                                                                                                                                                                                                                                       
        }

This mount occurs because the Jenkins agent container inherits mounts from the Jenkins master (i.e., docker run --volumes-from option), and the host's docker socket is mounted on the Jenkins master.

But this is not what you want on the Jenkins agent, because in this case the jenkins agent container is a system container with a dedicated docker daemon in it. Thus, mounting the host docker's socket into the Jenkins agent would defeat the purpose.

@michelkaeser: I suggest you setup things as explained in this Nestybox blog article, as it avoids the problem described above and it's a cleaner solution in my view.

Having said this, I'll continue to investigate why the failure was so drastic that it caused the machine's network subsystem to go down.

Answer 6 · 2020-03-07T19:45:29.000Z

Following up on:

I suspect the problem is related to the following prestart hook.

I did an experiment where I disabled the prestart hook processing in the sysbox runtime, but the problem still reproduces. This means the prestart hook is not the culprit.

Answer 7 · 2020-03-07T23:06:07.000Z

I'll continue to investigate why the failure was so drastic that it caused the machine's network subsystem to go down.

Found the culprit:

The Jenkins agent container is run with sysbox-runc. This agent inherits its volume mounts from the Jenkins master (i.e., docker run --runtime=sysbox-runc --volumes-from ...). The Jenkins master has a mount from the host's /var/run/docker.sock into the same directory inside the container. Thus, the Jenkins agent will also have such a mount.

When sysbox-runc is asked to bind-mount directories from the host into the container, it mounts the Ubuntu shiftfs filesystem onto said directories. In this case, sysbox-runc is mounting shiftfs into /var/run in the host. But this mount has the side-effect of making the directory non-executable, which pretty much kills the host since many of it's services are listening via sockets inside /var/run.

It's a bug in sysbox-runc: it has a list of host directories on which it should never mount shiftfs, and /var/run (as well as /run) are currently not on that list. They should be.

The reason we didn't stumble into this problem in the past is that we never mount the host's /var/run/docker.sock (or any other file under /var/run/) into a system container in our tests. The reasoning being that a system container acts as a virtual host, so it should have its own docker daemon and /var/run which are totally isolated from the one on the host.

In Michel's setup however, the Jenkins setup is such that it causes an implicit bind-mount of the host's /var/run/docker.sock into the Jenkins agent system container. But this setup is prone to context-related problems as I mentioned in a prior comment.

I think the way to overcome the problem is:

Modify the Jenkins setup by placing the Jenkins master, docker daemon, and agents all inside a system container as shown in this blog article. This way you completely avoid this issue as well as other context-related problems that arise from having the Jenkins master running in a different context that the containers that it creates.
Nestybox will make a fix in sysbox-runc to ensure it never mounts shiftfs over /run or /var/run on the host, to avoid rendering the host unusable as reported in this issue. This change will come in the next Sysbox release (in a few weeks from now).
Avoid bind-mounting /var/run or /run into a system container (until the fix for (2) is available). Such mounts should normally not be required as described above.

If for whatever reason you hit the problem and the host network becomes unresponsive, you can bring it back by unmounting shiftfs from the /var/run or /run directory.

Answer 8 · 2020-03-07T23:13:06.000Z

Modified the title of the issue now that we better understand it.

Answer 9 · 2020-03-08T03:54:51.000Z

Nestybox will make a fix in sysbox-runc to ensure it never mounts shiftfs over /run or /var/run on the host, to avoid rendering the host unusable as reported in this issue. This change will come in the next Sysbox release (in a few weeks from now).

FYI: the fix was committed to our internal repo; will be present in the next Sysbox release.

Let's keep the issue open until the next release occurs, at which point we can close it.

Answer 10 · 2020-03-08T12:44:48.000Z

Hi Cesar. Glad you were able to track down the problem with the information I provided. That sounds like an interesting track you had to go down to finally spot the issue :)

Regarding the Jenkins blog post you mentioned I was pretty aware of it before but think the setup described there does not fit our needs. The goal is to have pipelines having their own dedicated Docker setup so they don't conflict with others. When the agents use the system Jenkins Docker two pipelines still share the same stack.

I will try - but maybe you already know - would the desired setup be possible if Jenkins is also a system container so the pipeline's are nested system containers rather than siblings?

Answer 11 · 2020-03-09T05:58:35.000Z

Hi Michel,

Yes, glad we found it; thanks again for reporting it, much appreciated.

The goal is to have pipelines having their own dedicated Docker setup so they don't conflict with others.

I see, thanks for the clarification.

would the desired setup be possible if Jenkins is also a system container so the pipeline's are nested system containers rather than siblings?

This won't work unfortunately, as we don't yet support nesting of system containers. That's because our sys containers require the sysbox runtime, and the sysbox runtime requires true root privileges on the host, so we can't run the runtime inside a system container at this time.

Having said that, we are working on adding support for Nestybox sys containers to run privileged containers inside, which would mean that you can do "docker-in-docker" inside the system container.

I think this would allow you to accomplish your goal of giving each pipeline a dedicated Docker daemon: the jenkins master and docker would run in a sys container, and the jenkins agents would be privileged containers deployed inside the sys container. Each of those privileged containers can run docker inside, meaning that each pipeline gets a dedicated docker daemon.

With a bit of luck, our upcoming release will have this. By the way, this upcoming release will have preliminary support for running K8s inside sys containers (in case that's interesting to you).

Thanks again for giving Nestybox a shot!

Answer 12 · 2020-03-09T06:39:13.000Z

Another solution that comes to mind is to run the Jenkins master at host level (i.e., not inside a docker container) and run the Jenkins agents inside system containers.

Though I've not tried, I think this will work because I suspect the docker run --runtime=sysbox-runc command that the Jenkins master generates won't have the --volumes-from or -u 1000:1000 flags within it. This means that the Jenkins agents sys containers will run fine and each pipeline step gets a dedicated sys container.

I guess whether to use this approach or the one described in the prior comment depends on what parts of the setup you want to isolate with sys containers. If you want to isolate the workloads inside the Jenkins agents, the approach in the prior paragraph works well. If you want to isolate the entire Jenkins setup from the host (master & agents), the approach in the prior comment seems better.

Answer 13 · 2020-07-08T13:52:17.000Z

When sysbox-runc is asked to bind-mount directories from the host into the container, it mounts the Ubuntu shiftfs filesystem onto said directories. In this case, sysbox-runc is mounting shiftfs into /var/run in the host. But this mount has the side-effect of making the directory non-executable, which pretty much kills the host since many of it's services are listening via sockets inside /var/run.

It's a bug in sysbox-runc: it has a list of host directories on which it should never mount shiftfs, and /var/run (as well as /run) are currently not on that list. They should be.

Sysbox v0.2.0 fixes this problem (adds /var/run and /run to its shiftfs blacklist).

Closing.