Unable to run systemd in docker with ro /sys/fs/cgroup after systemd 248 host upgrade
fthiery opened this issue · 37 comments
BUG REPORT INFORMATION
I used to run docker containers with systemd as CMD without having to expose /sys/fs/cgroup as rw; this worked until systemd 248 on the host. Now it fails with
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
I opened a related issue on the systemd github repo: systemd/systemd#19245
Workarounds
- boot host with systemd.unified_cgroup_hierarchy=0
- remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro but this contaminates the host cgroup, causing e.g. docker top to get confused:
docker top debian-systemd
Error response from daemon: runc did not terminate successfully: container_linux.go:186: getting all container pids from cgroups caused: lstat /sys/fs/cgroup/system.slice/docker-817dfec3facbeb10c64d7b0fae478804b1177ae949e695e111b7c693569dd21a.scope: no such file or directory
: unknown
Steps to reproduce the issue:
Dockerfile:
FROM debian:buster-slim
ENV container docker
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive
USER root
WORKDIR /root
RUN set -x
RUN apt-get update -y \
&& apt-get install --no-install-recommends -y systemd \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
&& rm -f /var/run/nologin
RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
/etc/systemd/system/*.wants/* \
/lib/systemd/system/local-fs.target.wants/* \
/lib/systemd/system/sockets.target.wants/*udev* \
/lib/systemd/system/sockets.target.wants/*initctl* \
/lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
/lib/systemd/system/systemd-update-utmp*
VOLUME [ "/sys/fs/cgroup" ]
CMD ["/lib/systemd/systemd"]
Expected behaviour
systemd 247 (247.4-2-arch)
+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid
$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.
Welcome to Debian GNU/Linux 10 (buster)!
Set hostname to <bf431002c7c1>.
Couldn't move remaining userspace processes, ignoring: Input/output error
File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[ OK ] Listening on Journal Socket.
...
[ OK ] Reached target Graphical Interface.
Actual behaviour
Since systemd v248
$ /lib/systemd/systemd --version
systemd 248 (248-3-arch)
+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified
$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.
Welcome to Debian GNU/Linux 10 (buster)!
Set hostname to <fbb4fc19cb95>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
Output of docker version
:
$ docker version
Client:
Version: 20.10.5
API version: 1.41
Go version: go1.16
Git commit: 55c4c88966
Built: Wed Mar 3 16:51:54 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.5
API version: 1.41 (minimum version 1.12)
Go version: go1.16
Git commit: 363e9a88a1
Built: Wed Mar 3 16:51:28 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.4
GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
runc:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Output of docker info
:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-tp-docker)
Server:
Containers: 10
Running: 1
Paused: 0
Stopped: 9
Images: 61
Server Version: 20.10.5
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 5.11.11-arch1-1
Operating System: Arch Linux
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.712GiB
Name: homepc
ID: 67YO:62DZ:3NIF:TZT3:HTXP:BU6I:YBR3:XETA:7YCB:YGNN:MV6Q:QYN4
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://mirror.gcr.io/
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
x86_64 Intel hw, Arch Linux 5.11.11-arch1-1
Same here
It was working with 247
Is there already a fix for this?
Is there already a fix for this?
For reference, it is possible with namespace isolation. https://docs.docker.com/engine/security/userns-remap/
Or simply install podman.
remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro
It didn't help. I'm running Ubuntu 21.10 (Impish Indri).
For reference, it is possible with namespace isolation.
@skast96, it didn't help either. I edited /etc/docker/daemon.json
:
{"userns-remap": "default"}
Restarted docker
. The dockremap
user was created, as were the entries in /etc/sub{uid,gid}
. The /var/lib/docker/100000.100000
dir was created. docker image ls
produced no output. Then:
$ docker run -it --tmpfs /tmp --tmpfs /run --tmpfs /run/lock -v /sys/fs/cgroup:/sys/fs/cgroup jrei/systemd-ubuntu
systemd 245.4-4ubuntu3.16 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.
Welcome to Ubuntu 20.04.4 LTS!
Set hostname to <1bdd4443336d>.
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
So the only workaround is supposedly to switch to the cgroup v1 mode (systemd.unified_cgroup_hierarchy=0
):
/etc/default/grub
:
GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=0"
update-grub
- reboot
UPD And --cgroupns=host
+ -v /sys/fs/cgroup:/sys/fs/cgroup
(w/o :ro
), e.g.:
$ docker run -it --cgroupns=host --tmpfs /tmp --tmpfs /run --tmpfs /run/lock \
-v /sys/fs/cgroup:/sys/fs/cgroup jrei/systemd-ubuntu
Under cgroups v2 the default for --cgroupns
switches from 'host' to 'private'. When passing in the entirety of the host's /sys/fs/cgroup
explicitly then it's completely expected for this to fail in combination with the container runtime trying to create a private cgroup namespace inside it, as the cgroup path inside the container won't match up with the cgroup namespace...
As you noted, passing --cgroupns=host
can make this work. However, passing the host's /sys/fs/cgroup
into the container as rw
seems very unadvisable (might as well just use --privileged
?), and the solution in https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva/1054414#1054414 involving creating a systemd slice on the host seems more suitable (although I haven't tried getting it to work personally).
Aside from workarounds, it would be good to know what the Docker community's official advice on the matter is (where fundamentally we just want the container's /sys/fs/cgroup
to be writable in a non-privileged container).
@x-yuri the docker approach is not working that great tbh. It is working with namespace isolation when creating a extra slice for docker and adding this slice to the docker run
command like so:
docker run -it \
--cgroup-parent=docker.slice \
--cgroupns private \
--tmpfs /tmp \
--tmpfs /run \
--tmpfs /run/lock \
mySystemdImage:latest
That kinda worked for me. However our other containers stopped working with namespace isolation because they were not configured for that. That meant to much work in order to run one container with systemd.
So I suggest you to just install podman
. I experienced no drawbacks on my Arch Linux when having both docker and podman installed. Even the commands are the same. You would start your systemd container like that below with podman.
podman run -it mySystemdImage:latest
Actually for now I'm planning to employ the hybrid/legacy systemd mode (cgroup v1), which seems tolerable in my case. But podman
sounds like an interesting option (haven't tried it).
@x-yuri sounds like a plan. My reason for not using v1 is that I needed cgroups v2 to work.
I have discovered two additional workarounds for this issue that effectively retain all features of unified cgroupv2
while maintaining security - no need for the --privileged
flag and no access to the root of cgroupv2
hierarchy:
- Use the
--cgroupns host
Docker option and acgroupv2
sub-hierarchy volume binding for the container. Here is an example command:
# docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns host -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
systemd 252-13.el9_2 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization container-other.
Detected architecture x86-64.
Initializing machine ID from random generator.
Queued start job for default target Minimal target for containerized FreeIPA server.
[..]
Not perfect, next option is better IMO.
- Mount
/sys/fs/cgroup
on the host without thensdelegate
mount option. Although there isn't an explicit option to disablensdelegate
likenodiscard
fordiscard
(see link 1, link 2 for more information), there is a workaround. Simply run any container using Docker with the--cgroupns host
option and without anycgroup
volume bindings. For example:
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
# docker run --rm --cgroupns host ubuntu:latest echo done
done
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime 0 0
After implementing these steps, you can run a container with Docker using --cgroupns private
flag and volume binding of cgroupv2
sub-hierarchy. For example:
# docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns private -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
systemd 252-13.el9_2 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization container-other.
Detected architecture x86-64.
Initializing machine ID from random generator.
Queued start job for default target Minimal target for containerized FreeIPA server.
[..]
Please note that the information provided above applies specifically to CentOS Stream release 9 with kernel-ml-6.3.7-1.el9.elrepo
, systemd-252.4-598.13.hs.el9
(Hyperscale SIG) and docker-ce-24.0.2-1
(systemd
cgroup driver) although may help with a wide range of different scenarios.
- Use the
--cgroupns host
Docker option and acgroupv2
sub-hierarchy volume binding for the container. Here is an example command:docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns host -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
Thanks. This helped me too in starting a docker container with systemd inside (Fedora 37 host with cgroupv2).
I needed to add to daemon.json (and create the dockuser user on the docker host):
{
"userns-remap": "dockuser"
}
I left out some of the options you used though:
docker run \
-it \
--rm \
--name ubuntu_systemd_local \
--tmpfs /tmp \
--tmpfs /run \
--tmpfs /run/lock \
--cgroupns private \
ubuntu_systemd:local
I used this Dockerfile: (I created encrypted_password with mkpasswd -m sha512crypt 'password')
FROM ubuntu:22.04
ENV DEBIAN_FRONTEND noninteractive
RUN yes | unminimize && \
echo 'root:_encrypted_password_' | chpasswd -e && \
sed -i -e 's/archive.ubuntu/fi.archive.ubuntu/g' /etc/apt/sources.list && \
apt-get -y update && \
apt-get -y install apt-utils && \
apt-get -y install dialog && \
apt-get -y install iputils-ping bind9-host iproute2 netcat-openbsd && \
apt-get -y install systemd dbus dbus-user-session dbus-x11 dconf-cli && \
apt-get -y install vim less nmon glances iptraf-ng \
cifs-utils elinks elinks-data \
irssi lftp mc mc-data unrar nmap ctorrent iotop powertop \
w3m radvd caca-utils httpie jq firejail curl nmap stress-ng \
cksfv mtr htop smem gddrescue oidentd ntpdate sysfsutils \
cpulimit expect stress-ng pavucontrol rtorrent screen telnet \
cabextract youtube-dl sshuttle emacs nethogs alien \
exfatprogs p7zip mosh keepassxc virt-what fdisk && \
curl -s https://packagecloud.io/install/repositories/ookla/speedtest-cli/script.deb.sh | bash && \
apt-get -y install speedtest
STOPSIGNAL SIGRTMIN+3
CMD [ "/sbin/init" ]
According to the docs one can also use "userns-remap": "default"
in daemon.json
to let the Docker daemon handle the user/group creation.
Nevertheless, it didn't work for me:
$ docker run --name ubuntu_systemd_local -it --rm --tmpfs /run --tmpfs /run/lock --tmpfs /tmp --cgroupns private ubuntu_systemd:local
Failed to mount cgroup at /sys/fs/cgroup/systemd: Permission denied
[!!!!!!] Failed to mount API filesystems.
Exiting PID 1...
@darkdragon-001 I'm not sure how our setups differ. I just tried that again on Fedora 37/systemd 251.14/docker-ce 24.0.5 and the container starts up and runs systemd. I don't see that mount /sys/fs/cgroup/systemd on the host or in the container, but I don't get that error message either.
$ docker run \
-it \
--rm \
--name ubuntu_systemd_local \
--tmpfs /tmp \
--tmpfs /run \
--tmpfs /run/lock \
--cgroupns private \
ubuntu_systemd:local
systemd 249.11-0ubuntu3.9 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK
+PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization docker.
Detected architecture x86-64.
Welcome to Ubuntu 22.04.3 LTS!
Queued start job for default target Graphical Interface.
[ OK ] Created slice Slice /system/getty.
[ OK ] Created slice …e /system/modprobe.
[ OK ] Created slice … and Session Slice.
[ OK ] Started Dispat…le Directory Watch.
[ OK ] Started Forwar…ll Directory Watch.
etc.
I ran these commands in the ubuntu_systemd_local container:
# findmnt | grep cgroup
| `-/sys/fs/cgroup cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
# findmnt | grep systemd
(empty output)
My docker host (Fedora 37) /proc/cmdline is
BOOT_IMAGE=(hd0,msdos1)/fedora_boot/vmlinuz-6.4.12-100.fc37.x86_64 root=UUID=3d47da0b-7a0a-4852-b3d7-68707b78c669 ro rootflags=subvol=fedora_root resume=UUID=1bed99c8-b9d6-43ee-8f4e-3bcf23dfe693 net.ifnames=0
@darkdragon-001 I see a similar error message here for lxc:
Considering that podman supports running systemd inside containers flawlessly, I think it's time for docker to provide a definitive solution to this problem, which has been around for years.
The problematic part is related to cgroup-v2
hosts. The only reliable solution I found comes from:
mviereck/x11docker#349 (comment)
But it requires to start the container and then enter its mount namespace to remount /sys/fs/cgroup
in rw
mode. It's also tricky to implement that in k8s pod manifests.
It seems that, to support cgroup-v2
hosts, the only missing part is the mount of /sys/fs/cgroup
in rw
mode, which maybe should be allowed as an option also for unprivileged containers.
@marco-a-itl This config works on Fedora 37/systemd 251.14/docker-ce 24.0.5:
@marco-a-itl This config works on Fedora 37/systemd 251.14/docker-ce 24.0.5:
But it requires changes to the docker daemon configuration. It's not always possible to do that. For example, in k8s how would you implement such config ?
@marco-a-itl Well, you're interacting in the moby (docker) repository. We all know k8s doesn't care of docker, except for the docker images.
Yes, maybe this is not the best place to discuss this issue in a general form.
However k8s commonly uses containerd. Where is the mount of /sys/fs/cgroup
handled precisely ?
@marco-a-itl I just built the Ubuntu docker container with systemd I mentioned in #42275 (comment) I can run any commands that you would like for me to run in it to see that info.
Thank you, but actually I meant that somewhere in the docker engine stack there is some code that always mounts the directory /sys/fs/cgroup
as ro
. I'm no expert, so I'm not sure if this happens in containerd
, or in other parts.
And it seems to me that this is what prevents systemd from starting inside an unprivileged container that was configured with --cgroupns private
(i.e. the default and most secure option) on a cgroup-v2
host.
And it seems to me that this is what prevents systemd from starting inside an unprivileged container that was configured with --cgroupns private
I don't use the --privileged option in this setup.
This is the mount shown in the container:
# findmnt | grep cgroup
│ └─/sys/fs/cgroup cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
This is the mount shown in the container:
# findmnt | grep cgroup │ └─/sys/fs/cgroup cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
This is ok (mode is rw). However I assume that you obtained this result with userns-remapping.
I think that it should be possible to have the same result without such daemon option, with the proper modifications on the docker engine, like podman does.
For everyone lurking around
As discussion seems to continue and people not able to find the stuff...
Workaround (even without the usage of --priveleged
) already mentioned ->HERE<-
Spent some time looking at this today trying to run a systemd container under rootless docker. The docker daemon is running under a --user
systemd unit. I am on wsl with cgroups v2 enabled.
docker run -it --rm --tmpfs /tmp --tmpfs /run registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /init.scope control group: Read-only file system
.
docker creates the cgroup mount in the container, mounted readonly
docker run -it --rm --tmpfs /tmp --tmpfs -v /sys/fs/cgroup:/sys/fs/cgroup /run registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /init.scope control group: Permission denied
.
The "fake root" inside the container doesn't have permission to modify the cgroup that I mounted
docker run -it --rm --tmpfs /tmp --tmpfs /run --cgroupns=host registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /user.slice/user-1000.slice/user@1000.service/user.slice/..../init.scope control group: Read-only file system
Docker mounts the host cgroupns and systemd tries to create a cgroup at the appropriate level, but docker mounted the filesystem readonly still
docker run -it --rm --tmpfs /tmp --tmpfs /run -v /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service:/sys/fs/cgroup/user.slice/user@1000.service --cgroupns=host registry.access.redhat.com/ubi8/ubi-init:8.8
This works. It correctly creates the cgroup under the docker slice under the user slice. (you can mount the whole cgroupns rw but it wasn't necessary
docker run -it --rm --tmpfs /tmp --tmpfs /run -v /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service:/sys/fs/cgroup/user.slice/user@1000.service registry.access.redhat.com/ubi8/ubi-init:8.8
error mounting "/sys/fs...": read-only file system
I can't mount that folder r/w into the container with private cgroupns mode, presumably because docker setup the fake mount readonly
I can not find any documentation at all of what the expected behavior of cgroupns=private is supposed to be. Should it be a transparent mapping to a parent context? If so- should probably be mounted rw rather than ro. Also- systemd docs https://systemd.io/CONTAINER_INTERFACE/ seem to imply that's not a best practice anyway.
It seems to me that the best approach for my situation is to just set the default cgroupns back to 'host' to get this working properly.
I can not find any documentation at all of what the expected behavior of cgroupns=private is supposed to be. Should it be a transparent mapping to a parent context? If so- should probably be mounted rw rather than ro. Also- systemd docs https://systemd.io/CONTAINER_INTERFACE/ seem to imply that's not a best practice anyway.
--cgroupns=private
means that the container runtime will create a cgroup namespace for the container (podman's docs are more explicit about this).
You may find my blog post informative: https://lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/
I also have some tests that exercise different container setup modes for running systemd: https://github.com/LewisGaul/systemd-containers
We need something like what's described in containers/podman#14322 (reply in thread) (--security-opt unmask=/sys/fs/cgroup
).
I've just tested it, it seems to work flawlessly with
docker run --rm -it -v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup:rw --tmpfs /run --tmpfs /run/lock warewulf-1:latest /sbin/init
docker version: 26.1.4
Host runs: Arch, Kernel 6.9.4, systemd 255.7
Container runs: Debian 12.5 with systemd 252.22
Also works with a container running: Rockylinux 9.3, with systemd 252.32
Note that if host is running older cgroupv1, the /sys/fs/cgroup
on the host is a tmpfs
that's mounted as ro
and as such a lot of solutions from here won't work.
I'm having the same problem with dockerdesktop in macos m1 and I'm wondering if anyone has a workaround already?
Work ideally
docker run --rm --cgroupns=private --name freeipa-server-almalinux9 -ti \
-h ipa.hwdomain.lan --read-only --sysctl net.ipv6.conf.all.disable_ipv6=0 \
-v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup/warewulf.scope:ro \
-v ~/freeipa-data:/data:Z freeipa-almalinux9
I just faced the same issue and running the container with sysbox-runc runtime helped. With docker run -it --rm --runtime=sysbox-runc my-image
the container started.
I've just tested it, it seems to work flawlessly with
docker run --rm -it -v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup:rw --tmpfs /run --tmpfs /run/lock warewulf-1:latest /sbin/init
docker version: 26.1.4 Host runs: Arch, Kernel 6.9.4, systemd 255.7 Container runs: Debian 12.5 with systemd 252.22 Also works with a container running: Rockylinux 9.3, with systemd 252.32
How did you create /sys/fs/cgroup/warewulf.scope
?
How did you create
/sys/fs/cgroup/warewulf.scope
?
The scope is an ordinary folder, so can be created by Docker itself during volume mount, but this does not work for me - you have a scope, but systemd is not running within the mounted cgroup.