lucaslorentz/caddy-docker-proxy

Container id extraction logic breaks on docker with cgroup2

HeavenVolkoff opened this issue ยท 7 comments

Docker 20.10 gained support for cgroup v2.

However when running this caddy plugin in a container on Docker with cgroup2 the current logic for container ID extraction fails.

As far as I can tell the problem seems to be at this part of the code:

// GetCurrentContainerID returns the id of the container running this application
func (wrapper *dockerUtils) GetCurrentContainerID() (string, error) {
if runtime.GOOS == "windows" {
return os.Hostname()
}
bytes, err := ioutil.ReadFile("/proc/self/cgroup")
if err != nil {
return "", err
}
if len(bytes) == 0 {
return "", errors.New("Cannot read /proc/self/cgroup")
}
return wrapper.ExtractContainerID(string(bytes))
}
func (wrapper *dockerUtils) ExtractContainerID(cgroups string) (string, error) {
idRegex := regexp.MustCompile(`(?i):[^:]*\bcpu\b[^:]*:[^/]*/.*([[:alnum:]]{64}).*`)
matches := idRegex.FindStringSubmatch(cgroups)
if len(matches) == 0 {
return "", fmt.Errorf("Cannot find container id in cgroups: %v", cgroups)
}
return matches[len(matches)-1], nil
}

This seems to be based of a know hack for getting container information from within the container itself.

The problem is that with cgroup v2, the file /proc/self/cgroup doesn't contain the previously available information.

Below is the content of the file in a container running on a Docker with cgroup v2:

$> cat /proc/self/cgroup
0::/

Digging through the other files in /proc/self, there seems to be another (hacky) way for retrieving the container id.

The file /proc/self/mountinfo contains the id in the mount paths for the /etc/resolv.conf, /etc/hostname, /etc/hosts files:

$> cat /proc/self/mountinfo
1582 1433 0:29 /@/var/lib/docker/btrfs/subvolumes/3388737b99462f26ff76d17105b237d2c2f312338f3c6585af52d8745add9af1 / rw master:1 - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=4409,subvol=/@/var/lib/docker/btrfs/subvolumes/3388737b99462f26ff76d17105b237d2c2f312338f3c6585af52d8745add9af1
1583 1582 0:189 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
1584 1582 0:190 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1585 1584 0:191 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
1586 1582 0:192 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro
1587 1586 0:25 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup rw,nsdelegate,memory_recursiveprot
1588 1584 0:188 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
1589 1584 0:193 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k,inode64
1590 1582 0:29 /@/usr/bin/docker-init /sbin/docker-init ro master:1 - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=256,subvol=/@
1591 1582 0:187 / /tmp rw,nosuid,nodev,noexec,relatime master:645 - tmpfs tmpfs rw,size=262144k,mode=777,inode64
1592 1582 0:75 / /config rw,relatime master:666 - nfs4 :/caddy/config rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.2,local_lock=none,addr=10.0.0.2
1593 1582 0:75 / /data rw,relatime master:673 - nfs4 :/caddy/data rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.2,local_lock=none,addr=10.0.0.2
1594 1582 0:75 / /var/log rw,relatime master:659 - nfs4 :/logs/caddy rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.2,local_lock=none,addr=10.0.0.2
1595 1582 0:29 /@/var/lib/docker/containers/fcc626d54fda7061cc76bbdb0a0d992bc7ea4864b88340336857c43fd37b3df0/resolv.conf /etc/resolv.conf rw - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=256,subvol=/@
1596 1582 0:29 /@/var/lib/docker/containers/fcc626d54fda7061cc76bbdb0a0d992bc7ea4864b88340336857c43fd37b3df0/hostname /etc/hostname rw - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=256,subvol=/@
1597 1582 0:29 /@/var/lib/docker/containers/fcc626d54fda7061cc76bbdb0a0d992bc7ea4864b88340336857c43fd37b3df0/hosts /etc/hosts rw - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=256,subvol=/@
1598 1582 0:75 / /usr/share/caddy rw,relatime master:652 - nfs4 :/caddy/static rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.2,local_lock=none,addr=10.0.0.2
1434 1583 0:189 /bus /proc/bus ro,relatime - proc proc rw
1453 1583 0:189 /fs /proc/fs ro,relatime - proc proc rw
1496 1583 0:189 /irq /proc/irq ro,relatime - proc proc rw
1497 1583 0:189 /sys /proc/sys ro,relatime - proc proc rw
1498 1583 0:189 /sysrq-trigger /proc/sysrq-trigger ro,relatime - proc proc rw
1499 1583 0:194 / /proc/asound ro,relatime - tmpfs tmpfs ro,inode64
1500 1583 0:195 / /proc/acpi ro,relatime - tmpfs tmpfs ro,inode64
1501 1583 0:190 /null /proc/kcore rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1502 1583 0:190 /null /proc/keys rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1503 1583 0:190 /null /proc/latency_stats rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1504 1583 0:190 /null /proc/timer_list rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1505 1583 0:190 /null /proc/sched_debug rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1506 1583 0:196 / /proc/scsi ro,relatime - tmpfs tmpfs ro,inode64
1507 1586 0:197 / /sys/firmware ro,relatime - tmpfs tmpfs ro,inode64

The logic for extracting the container id could be extended to parse this file in the case the information isn't available in /proc/self/cgroup. However, I don't how reliable this solution can be, as I didn't test enough to be certain this mount paths are always available no matter the options a container is run with.

This also breaks the {{upstreams}} token in labels of containers using ingress networks. That is due to the container id being used to resolve which ingress networks the caddy container is connected. So, anyone trying to use this with cgroups v2 should avoid using {{upstreams}} for now.

pwFoo commented

Oh, I plan a docker update. So following here...

The /proc/self/mountinfo file won't work as a reliable solution to this problem. Some couple more tests, and I was able to completely remove all container id references from it:

$> docker run --rm -it -v "$(pwd)/empty":/etc/resolv.conf -v "$(pwd)/empty":/etc/hostname -v "$(pwd)/empty":/etc/hosts -v "$(pwd)":/data -v "$(pwd)":/config --entrypoint cat lucaslorentz/caddy-docker-proxy:2.3-alpine /proc/self/mountinfo
1629 1448 0:29 /@/var/lib/docker/btrfs/subvolumes/a8c73e312a758780b3dafd09d3064f578756afb2910e9a564c5f4ebbeccc825a / rw master:1 - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=4442,subvol=/@/var/lib/docker/btrfs/subvolumes/a8c73e312a758780b3dafd09d3064f578756afb2910e9a564c5f4ebbeccc825a
1630 1629 0:206 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
1631 1629 0:207 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1632 1631 0:208 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
1633 1629 0:209 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro
1634 1633 0:25 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup rw,nsdelegate,memory_recursiveprot
1635 1631 0:205 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
1636 1631 0:210 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k,inode64
1637 1629 0:29 /@home/admin /config rw - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=265,subvol=/@home
1638 1629 0:29 /@home/admin /data rw - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=265,subvol=/@home
1639 1629 0:29 /@home/admin/empty /etc/resolv.conf rw - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=265,subvol=/@home
1640 1629 0:29 /@home/admin/empty /etc/hostname rw - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=265,subvol=/@home
1641 1629 0:29 /@home/admin/empty /etc/hosts rw - btrfs /dev/sda2 rw,lazytime,compress=zstd:3,ssd,discard=async,space_cache,subvolid=265,subvol=/@home
1449 1631 0:208 /0 /dev/console rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
1450 1630 0:206 /bus /proc/bus ro,relatime - proc proc rw
1451 1630 0:206 /fs /proc/fs ro,relatime - proc proc rw
1452 1630 0:206 /irq /proc/irq ro,relatime - proc proc rw
1453 1630 0:206 /sys /proc/sys ro,relatime - proc proc rw
1454 1630 0:206 /sysrq-trigger /proc/sysrq-trigger ro,relatime - proc proc rw
1455 1630 0:211 / /proc/asound ro,relatime - tmpfs tmpfs ro,inode64
1456 1630 0:212 / /proc/acpi ro,relatime - tmpfs tmpfs ro,inode64
1457 1630 0:207 /null /proc/kcore rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1458 1630 0:207 /null /proc/keys rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1459 1630 0:207 /null /proc/latency_stats rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1460 1630 0:207 /null /proc/timer_list rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1461 1630 0:207 /null /proc/sched_debug rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
1462 1630 0:213 / /proc/scsi ro,relatime - tmpfs tmpfs ro,inode64
1463 1633 0:214 / /sys/firmware ro,relatime - tmpfs tmpfs ro,inode64

But maybe it's possible to use the docker API, to which we already have access anyway, to filter the current container id. Something along the lines of doing a ContainerList, then using the container's own hostname and ip addresses to filter out the results. Finally, use ContainerArchiveInfo on each remnant, with a randomly named empty file that we generate at one of the known writable volumes: /config or /data, to pick the correct one.

@HeavenVolkoff Thanks for the great investigation so far.

I just had a look at the code again.
The container id retrieval is part of a logic to determine which networks caddy and the target container have in common.
Thankfully that code seems to be quite isolated already, and you can completely skip it by specifying yourself what are the ingress networks.
You can do that with env variable CADDY_INGRESS_NETWORKS or arg ingress-networks.

-ingress-networks string
        Comma separated name of ingress networks connecting caddy servers to containers.
        When not defined, networks attached to controller container are considered ingress networks

Can you please give it a try? It would be nice to know that caddy-proxy is already usable with cgroup2.

If we find a reliable way of retrieving the container ID we can update that fallback logic as well.

Can you please give it a try? It would be nice to know that caddy-proxy is already usable with cgroup2.

@lucaslorentz as far as I tested yesterday and today, the core functionality of the plugin seems to be working fine under Docker with cgroups v2. Just the container id extraction logic and related features that are not working correctly.

You can do that with env variable CADDY_INGRESS_NETWORKS or arg ingress-networks.

The CADDY_INGRESS_NETWORKS works and allows functionality related to ingress networks identification, including {{upstreams}}, to function correctly. However, the usability is not great on docker swarm, which is what I am running, due to it's default behavior of appending the swarm name to all assets created during a stack deploy. As such, the network name needed for CADDY_INGRESS_NETWORKS differs from what is defined in the compose file.

Maybe it would be interesting to add a small warning to the README for users of Docker > 20.10 with cgroups v2.

@HeavenVolkoff
Readme updated!

ulope commented

The CADDY_INGRESS_NETWORKS certainly works but as @HeavenVolkoff mentioned is not a proper solution in more complex cases where the network name either isn't known or could contain dynamic parts only knowable at runtime.