moby/moby

Unable to run systemd in docker with ro /sys/fs/cgroup after systemd 248 host upgrade

fthiery opened this issue · 37 comments


BUG REPORT INFORMATION

I used to run docker containers with systemd as CMD without having to expose /sys/fs/cgroup as rw; this worked until systemd 248 on the host. Now it fails with

Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

I opened a related issue on the systemd github repo: systemd/systemd#19245

Workarounds

  • boot host with systemd.unified_cgroup_hierarchy=0
  • remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro but this contaminates the host cgroup, causing e.g. docker top to get confused:
docker top debian-systemd
Error response from daemon: runc did not terminate successfully: container_linux.go:186: getting all container pids from cgroups caused: lstat /sys/fs/cgroup/system.slice/docker-817dfec3facbeb10c64d7b0fae478804b1177ae949e695e111b7c693569dd21a.scope: no such file or directory
: unknown

Steps to reproduce the issue:

Dockerfile:

FROM debian:buster-slim

ENV container docker
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive

USER root
WORKDIR /root

RUN set -x

RUN apt-get update -y \
    && apt-get install --no-install-recommends -y systemd \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
    && rm -f /var/run/nologin

RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
    /etc/systemd/system/*.wants/* \
    /lib/systemd/system/local-fs.target.wants/* \
    /lib/systemd/system/sockets.target.wants/*udev* \
    /lib/systemd/system/sockets.target.wants/*initctl* \
    /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
    /lib/systemd/system/systemd-update-utmp*

VOLUME [ "/sys/fs/cgroup" ]

CMD ["/lib/systemd/systemd"]

Expected behaviour

systemd 247 (247.4-2-arch)
+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid
$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <bf431002c7c1>.
Couldn't move remaining userspace processes, ignoring: Input/output error
File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[  OK  ] Listening on Journal Socket.
...
[  OK  ] Reached target Graphical Interface.

Actual behaviour

Since systemd v248

$ /lib/systemd/systemd --version
systemd 248 (248-3-arch)
+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <fbb4fc19cb95>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

Output of docker version:

$ docker version
Client:
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.16
 Git commit:        55c4c88966
 Built:             Wed Mar  3 16:51:54 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16
  Git commit:       363e9a88a1
  Built:            Wed Mar  3 16:51:28 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-tp-docker)

Server:
 Containers: 10
  Running: 1
  Paused: 0
  Stopped: 9
 Images: 61
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.11.11-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 7.712GiB
 Name: homepc
 ID: 67YO:62DZ:3NIF:TZT3:HTXP:BU6I:YBR3:XETA:7YCB:YGNN:MV6Q:QYN4
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://mirror.gcr.io/
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

x86_64 Intel hw, Arch Linux 5.11.11-arch1-1

Same here
It was working with 247

Is there already a fix for this?

Is there already a fix for this?

For reference, it is possible with namespace isolation. https://docs.docker.com/engine/security/userns-remap/
Or simply install podman.

remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro

It didn't help. I'm running Ubuntu 21.10 (Impish Indri).

For reference, it is possible with namespace isolation.

@skast96, it didn't help either. I edited /etc/docker/daemon.json:

{"userns-remap": "default"}

Restarted docker. The dockremap user was created, as were the entries in /etc/sub{uid,gid}. The /var/lib/docker/100000.100000 dir was created. docker image ls produced no output. Then:

$ docker run -it --tmpfs /tmp --tmpfs /run --tmpfs /run/lock -v /sys/fs/cgroup:/sys/fs/cgroup jrei/systemd-ubuntu
systemd 245.4-4ubuntu3.16 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Ubuntu 20.04.4 LTS!

Set hostname to <1bdd4443336d>.
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

So the only workaround is supposedly to switch to the cgroup v1 mode (systemd.unified_cgroup_hierarchy=0):

  • /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=0"
  • update-grub
  • reboot

UPD And --cgroupns=host + -v /sys/fs/cgroup:/sys/fs/cgroup (w/o :ro), e.g.:

$ docker run -it --cgroupns=host --tmpfs /tmp --tmpfs /run --tmpfs /run/lock \
    -v /sys/fs/cgroup:/sys/fs/cgroup jrei/systemd-ubuntu

Under cgroups v2 the default for --cgroupns switches from 'host' to 'private'. When passing in the entirety of the host's /sys/fs/cgroup explicitly then it's completely expected for this to fail in combination with the container runtime trying to create a private cgroup namespace inside it, as the cgroup path inside the container won't match up with the cgroup namespace...

As you noted, passing --cgroupns=host can make this work. However, passing the host's /sys/fs/cgroup into the container as rw seems very unadvisable (might as well just use --privileged?), and the solution in https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva/1054414#1054414 involving creating a systemd slice on the host seems more suitable (although I haven't tried getting it to work personally).

Aside from workarounds, it would be good to know what the Docker community's official advice on the matter is (where fundamentally we just want the container's /sys/fs/cgroup to be writable in a non-privileged container).

Related issue: #42040

@x-yuri the docker approach is not working that great tbh. It is working with namespace isolation when creating a extra slice for docker and adding this slice to the docker run command like so:

docker run -it \
    --cgroup-parent=docker.slice \
    --cgroupns private \
    --tmpfs /tmp \
    --tmpfs /run \
    --tmpfs /run/lock \
    mySystemdImage:latest 

That kinda worked for me. However our other containers stopped working with namespace isolation because they were not configured for that. That meant to much work in order to run one container with systemd.

So I suggest you to just install podman. I experienced no drawbacks on my Arch Linux when having both docker and podman installed. Even the commands are the same. You would start your systemd container like that below with podman.

podman run -it mySystemdImage:latest 

Actually for now I'm planning to employ the hybrid/legacy systemd mode (cgroup v1), which seems tolerable in my case. But podman sounds like an interesting option (haven't tried it).

@x-yuri sounds like a plan. My reason for not using v1 is that I needed cgroups v2 to work.

I have discovered two additional workarounds for this issue that effectively retain all features of unified cgroupv2 while maintaining security - no need for the --privileged flag and no access to the root of cgroupv2 hierarchy:

  1. Use the --cgroupns host Docker option and a cgroupv2 sub-hierarchy volume binding for the container. Here is an example command:
# docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns host -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
systemd 252-13.el9_2 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization container-other.
Detected architecture x86-64.
Initializing machine ID from random generator.
Queued start job for default target Minimal target for containerized FreeIPA server.
[..]

Not perfect, next option is better IMO.

  1. Mount /sys/fs/cgroup on the host without the nsdelegate mount option. Although there isn't an explicit option to disable nsdelegate like nodiscard for discard (see link 1, link 2 for more information), there is a workaround. Simply run any container using Docker with the --cgroupns host option and without any cgroup volume bindings. For example:
# grep cgroup /proc/mounts 
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
# docker run --rm --cgroupns host ubuntu:latest echo done
done
# grep cgroup /proc/mounts 
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime 0 0

After implementing these steps, you can run a container with Docker using --cgroupns private flag and volume binding of cgroupv2 sub-hierarchy. For example:

# docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns private -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
systemd 252-13.el9_2 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization container-other.
Detected architecture x86-64.
Initializing machine ID from random generator.
Queued start job for default target Minimal target for containerized FreeIPA server.
[..]

Please note that the information provided above applies specifically to CentOS Stream release 9 with kernel-ml-6.3.7-1.el9.elrepo, systemd-252.4-598.13.hs.el9 (Hyperscale SIG) and docker-ce-24.0.2-1 (systemd cgroup driver) although may help with a wide range of different scenarios.

aki-k commented
  1. Use the --cgroupns host Docker option and a cgroupv2 sub-hierarchy volume binding for the container. Here is an example command:

docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns host -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9

Thanks. This helped me too in starting a docker container with systemd inside (Fedora 37 host with cgroupv2).

I needed to add to daemon.json (and create the dockuser user on the docker host):

{
    "userns-remap": "dockuser"
}

I left out some of the options you used though:

docker run \
-it \
--rm \
--name ubuntu_systemd_local \
--tmpfs /tmp \
--tmpfs /run \
--tmpfs /run/lock \
--cgroupns private \
ubuntu_systemd:local

I used this Dockerfile: (I created encrypted_password with mkpasswd -m sha512crypt 'password')

FROM ubuntu:22.04
ENV DEBIAN_FRONTEND noninteractive
RUN yes | unminimize && \
echo 'root:_encrypted_password_' | chpasswd -e && \
sed -i -e 's/archive.ubuntu/fi.archive.ubuntu/g' /etc/apt/sources.list && \
apt-get -y update && \
apt-get -y install apt-utils && \
apt-get -y install dialog && \
apt-get -y install iputils-ping bind9-host iproute2 netcat-openbsd && \
apt-get -y install systemd dbus dbus-user-session dbus-x11 dconf-cli && \
apt-get -y install vim less nmon glances iptraf-ng \
cifs-utils elinks elinks-data \
irssi lftp mc mc-data unrar nmap ctorrent iotop powertop \
w3m radvd caca-utils httpie jq firejail curl nmap stress-ng \
cksfv mtr htop smem gddrescue oidentd ntpdate sysfsutils \
cpulimit expect stress-ng pavucontrol rtorrent screen telnet \
cabextract youtube-dl sshuttle emacs nethogs alien \
exfatprogs p7zip mosh keepassxc virt-what fdisk && \
curl -s https://packagecloud.io/install/repositories/ookla/speedtest-cli/script.deb.sh | bash && \
apt-get -y install speedtest
STOPSIGNAL SIGRTMIN+3
CMD [ "/sbin/init" ]

According to the docs one can also use "userns-remap": "default" in daemon.json to let the Docker daemon handle the user/group creation.

Nevertheless, it didn't work for me:

$ docker run --name ubuntu_systemd_local -it --rm --tmpfs /run --tmpfs /run/lock --tmpfs /tmp --cgroupns private ubuntu_systemd:local
Failed to mount cgroup at /sys/fs/cgroup/systemd: Permission denied
[!!!!!!] Failed to mount API filesystems.
Exiting PID 1...
aki-k commented

@darkdragon-001 I'm not sure how our setups differ. I just tried that again on Fedora 37/systemd 251.14/docker-ce 24.0.5 and the container starts up and runs systemd. I don't see that mount /sys/fs/cgroup/systemd on the host or in the container, but I don't get that error message either.

$ docker run \                                                              
-it \                                                                                                    
--rm \                                                                                                   
--name ubuntu_systemd_local \                                                                            
--tmpfs /tmp \                                                                                           
--tmpfs /run \                                                                                           
--tmpfs /run/lock \                                                                                      
--cgroupns private \                                                                                     
ubuntu_systemd:local                                                                                     
systemd 249.11-0ubuntu3.9 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK 
+PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization docker.                                                                          
Detected architecture x86-64.                                                                            
                                                                                                         
Welcome to Ubuntu 22.04.3 LTS!                                                                           

Queued start job for default target Graphical Interface.                                                 
[  OK  ] Created slice Slice /system/getty.                                                              
[  OK  ] Created slice …e /system/modprobe.                                                              
[  OK  ] Created slice … and Session Slice.                                                              
[  OK  ] Started Dispat…le Directory Watch.                                                              
[  OK  ] Started Forwar…ll Directory Watch.

etc.

I ran these commands in the ubuntu_systemd_local container:

# findmnt | grep cgroup
| `-/sys/fs/cgroup      cgroup      cgroup2  rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
# findmnt | grep systemd
(empty output)

My docker host (Fedora 37) /proc/cmdline is

BOOT_IMAGE=(hd0,msdos1)/fedora_boot/vmlinuz-6.4.12-100.fc37.x86_64 root=UUID=3d47da0b-7a0a-4852-b3d7-68707b78c669 ro rootflags=subvol=fedora_root resume=UUID=1bed99c8-b9d6-43ee-8f4e-3bcf23dfe693 net.ifnames=0
aki-k commented

@darkdragon-001 I see a similar error message here for lxc:

lxc/lxc#4072

Considering that podman supports running systemd inside containers flawlessly, I think it's time for docker to provide a definitive solution to this problem, which has been around for years.

The problematic part is related to cgroup-v2 hosts. The only reliable solution I found comes from:
mviereck/x11docker#349 (comment)
But it requires to start the container and then enter its mount namespace to remount /sys/fs/cgroup in rw mode. It's also tricky to implement that in k8s pod manifests.

It seems that, to support cgroup-v2 hosts, the only missing part is the mount of /sys/fs/cgroup in rw mode, which maybe should be allowed as an option also for unprivileged containers.

aki-k commented

@marco-a-itl This config works on Fedora 37/systemd 251.14/docker-ce 24.0.5:

#42275 (comment)

@marco-a-itl This config works on Fedora 37/systemd 251.14/docker-ce 24.0.5:

#42275 (comment)

But it requires changes to the docker daemon configuration. It's not always possible to do that. For example, in k8s how would you implement such config ?

aki-k commented

@marco-a-itl Well, you're interacting in the moby (docker) repository. We all know k8s doesn't care of docker, except for the docker images.

Yes, maybe this is not the best place to discuss this issue in a general form.
However k8s commonly uses containerd. Where is the mount of /sys/fs/cgroup handled precisely ?

aki-k commented

@marco-a-itl I just built the Ubuntu docker container with systemd I mentioned in #42275 (comment) I can run any commands that you would like for me to run in it to see that info.

Thank you, but actually I meant that somewhere in the docker engine stack there is some code that always mounts the directory /sys/fs/cgroup as ro. I'm no expert, so I'm not sure if this happens in containerd, or in other parts.

And it seems to me that this is what prevents systemd from starting inside an unprivileged container that was configured with --cgroupns private (i.e. the default and most secure option) on a cgroup-v2 host.

aki-k commented

@marco-a-itl

And it seems to me that this is what prevents systemd from starting inside an unprivileged container that was configured with --cgroupns private

I don't use the --privileged option in this setup.

aki-k commented

@marco-a-itl

This is the mount shown in the container:

# findmnt | grep cgroup
│ └─/sys/fs/cgroup      cgroup      cgroup2  rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot

@marco-a-itl

This is the mount shown in the container:

# findmnt | grep cgroup
│ └─/sys/fs/cgroup      cgroup      cgroup2  rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot

This is ok (mode is rw). However I assume that you obtained this result with userns-remapping.

I think that it should be possible to have the same result without such daemon option, with the proper modifications on the docker engine, like podman does.

aki-k commented

@marco-a-itl

However I assume that you obtained this result with userns-remapping.

That's correct

For everyone lurking around

As discussion seems to continue and people not able to find the stuff...
Workaround (even without the usage of --priveleged) already mentioned ->HERE<-

Spent some time looking at this today trying to run a systemd container under rootless docker. The docker daemon is running under a --user systemd unit. I am on wsl with cgroups v2 enabled.

docker run -it --rm --tmpfs /tmp --tmpfs /run registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /init.scope control group: Read-only file system .
docker creates the cgroup mount in the container, mounted readonly

docker run -it --rm --tmpfs /tmp --tmpfs -v /sys/fs/cgroup:/sys/fs/cgroup /run registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /init.scope control group: Permission denied .
The "fake root" inside the container doesn't have permission to modify the cgroup that I mounted

docker run -it --rm --tmpfs /tmp --tmpfs /run --cgroupns=host registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /user.slice/user-1000.slice/user@1000.service/user.slice/..../init.scope control group: Read-only file system
Docker mounts the host cgroupns and systemd tries to create a cgroup at the appropriate level, but docker mounted the filesystem readonly still

docker run -it --rm --tmpfs /tmp --tmpfs /run -v /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service:/sys/fs/cgroup/user.slice/user@1000.service --cgroupns=host registry.access.redhat.com/ubi8/ubi-init:8.8
This works. It correctly creates the cgroup under the docker slice under the user slice. (you can mount the whole cgroupns rw but it wasn't necessary

docker run -it --rm --tmpfs /tmp --tmpfs /run -v /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service:/sys/fs/cgroup/user.slice/user@1000.service registry.access.redhat.com/ubi8/ubi-init:8.8
error mounting "/sys/fs...": read-only file system
I can't mount that folder r/w into the container with private cgroupns mode, presumably because docker setup the fake mount readonly

I can not find any documentation at all of what the expected behavior of cgroupns=private is supposed to be. Should it be a transparent mapping to a parent context? If so- should probably be mounted rw rather than ro. Also- systemd docs https://systemd.io/CONTAINER_INTERFACE/ seem to imply that's not a best practice anyway.

It seems to me that the best approach for my situation is to just set the default cgroupns back to 'host' to get this working properly.

I can not find any documentation at all of what the expected behavior of cgroupns=private is supposed to be. Should it be a transparent mapping to a parent context? If so- should probably be mounted rw rather than ro. Also- systemd docs https://systemd.io/CONTAINER_INTERFACE/ seem to imply that's not a best practice anyway.

--cgroupns=private means that the container runtime will create a cgroup namespace for the container (podman's docs are more explicit about this).

You may find my blog post informative: https://lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/

I also have some tests that exercise different container setup modes for running systemd: https://github.com/LewisGaul/systemd-containers

lubo commented

We need something like what's described in containers/podman#14322 (reply in thread) (--security-opt unmask=/sys/fs/cgroup).

I've just tested it, it seems to work flawlessly with
docker run --rm -it -v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup:rw --tmpfs /run --tmpfs /run/lock warewulf-1:latest /sbin/init

docker version: 26.1.4
Host runs: Arch, Kernel 6.9.4, systemd 255.7
Container runs: Debian 12.5 with systemd 252.22
Also works with a container running: Rockylinux 9.3, with systemd 252.32

Note that if host is running older cgroupv1, the /sys/fs/cgroup on the host is a tmpfs that's mounted as ro and as such a lot of solutions from here won't work.

I'm having the same problem with dockerdesktop in macos m1 and I'm wondering if anyone has a workaround already?

Work ideally

docker run --rm --cgroupns=private --name freeipa-server-almalinux9 -ti \
    -h ipa.hwdomain.lan --read-only --sysctl net.ipv6.conf.all.disable_ipv6=0 \
    -v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup/warewulf.scope:ro \
    -v ~/freeipa-data:/data:Z freeipa-almalinux9

I just faced the same issue and running the container with sysbox-runc runtime helped. With docker run -it --rm --runtime=sysbox-runc my-image the container started.

I've just tested it, it seems to work flawlessly with docker run --rm -it -v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup:rw --tmpfs /run --tmpfs /run/lock warewulf-1:latest /sbin/init

docker version: 26.1.4 Host runs: Arch, Kernel 6.9.4, systemd 255.7 Container runs: Debian 12.5 with systemd 252.22 Also works with a container running: Rockylinux 9.3, with systemd 252.32

How did you create /sys/fs/cgroup/warewulf.scope?

How did you create /sys/fs/cgroup/warewulf.scope?

The scope is an ordinary folder, so can be created by Docker itself during volume mount, but this does not work for me - you have a scope, but systemd is not running within the mounted cgroup.