When --cpuset-cpus argument is used, processes inspecting CPU configuration in the container see all cores
benjamincburns opened this issue ยท 31 comments
Output of docker version
:
Client:
Version: 1.10.2
API version: 1.22
Go version: go1.5.3
Git commit: c3959b1
Built: Mon Feb 22 16:16:33 2016
OS/Arch: linux/amd64
Server:
Version: 1.10.2
API version: 1.22
Go version: go1.5.3
Git commit: c3959b1
Built: Mon Feb 22 16:16:33 2016
OS/Arch: linux/amd64
Output of docker info
:
sudo docker info
Containers: 66
Running: 55
Paused: 0
Stopped: 11
Images: 110
Server Version: 1.10.2
Storage Driver: devicemapper
Pool Name: docker-253:0-73188844-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: ext4
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 5.769 GB
Data Space Total: 107.4 GB
Data Space Available: 22.45 GB
Metadata Space Used: 13.09 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.134 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 3.10.0-229.14.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 251.6 GiB
Name: [redacted]
ID: [redacted]
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Provide additional environment details (AWS, VirtualBox, physical, etc.):
Physical machine
List the steps to reproduce the issue:
- Run something like
docker run -it --cpuset-cpus=0 centos:centos7
- In the container's console, run
grep processor /proc/cpuinfo | wc -l
Describe the results you received:
Output: 32
Describe the results you expected:
Output: 1
Provide additional info you think is important:
Per the title, it appears that docker 1.10.2 isn't respecting the --cpuset-cpus
argument. We have a number of containers for applications which use thread pools which are sized based on the number of cores available. Since updating to 1.10.2 (from a various array of versions starting somewhere in 1.3.x), the thread counts on our docker hosts are through the roof. [Edit: this wasn't actually linked to the update, but rather we'd deployed a few new containers which ran on mono at around the same time. This is still an issue, however.]
OS version info:
user@host ~ $ cat /etc/*release*
CentOS Linux release 7.1.1503 (Core)
Derived from Red Hat Enterprise Linux 7.1 (Source)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
CentOS Linux release 7.1.1503 (Core)
CentOS Linux release 7.1.1503 (Core)
cpe:/o:centos:centos:7
On the surface this issue looks to be similar to what's described as Ubuntu bug ID 1435571, though I can see how this behaviour might manifest from some other root cause. However in this case it may have been a kernel bug, as they've fixed it with these two kernel patches.
Knowing very little about cgroups myself, I'd also wonder if CentOS7 issue 9078 isn't related.
Either way, I raised the issue here on the chance that either this is an issue specific to docker and not the host OS, or that docker would be improved by including a workaround to this issue.
@benjamincburns can you try running the check-config.sh
script? It's possible this is not supported or enabled in your kernel; https://github.com/docker/docker/blob/master/contrib/check-config.sh
Thanks @thaJeztah.
Before seeing your comment I fired up a fresh install of CentOS 7 and made sure it was up to date. I then installed docker according to the official installation instructions. This issue does not occur in that configuration.
I will run this the check-config script in both locations and compare the output.
If it turns out that this was an issue with this feature not being supported by the kernel, I'd suggest that this script be converted into runtime checks within docker itself so that the docker CLI can fail with an appropriate error message when trying to create a container which would use kernel features that aren't supported.
I have run the check-config.sh
script on the test VM (where things work properly), and on my actual docker host. Full output for the known-good machine is at local-vm-check-config-output.txt.
Their diff:
user@hostname:~$ diff -u docker-host-check-config-output.txt local-vm-check-config-output.txt
--- docker-host-check-config-output.txt 2016-03-01 15:01:08.238722606 +1300
+++ local-vm-check-config-output.txt 2016-03-01 15:01:26.494242760 +1300
@@ -1,5 +1,5 @@
warning: /proc/config.gz does not exist, searching other paths for kernel config ...
-info: reading kernel config from /boot/config-3.10.0-229.14.1.el7.x86_64 ...
+info: reading kernel config from /boot/config-3.10.0-327.10.1.el7.x86_64 ...
Generally Necessary:
- cgroup hierarchy: properly mounted [/sys/fs/cgroup]
Note of course that the last line is not a deletion, but the hyphen is part of the script output.
I'll see if I can't review patches which have been applied between 3.10.0-229.14.1 and 3.10.0-327.10.1.
Actually, I think the patch review is unnecessary, as this issue occurs on a different docker host in our prod environment which is already running 3.10.0-327.10.1, and the latest userspace, CentOS 7.2.1511. To avoid (or inadvertently create) confusion, I refer to this host as host-with-latest-userspace-and-kernel
below.
Copy & pasted repro output, modified slightly to change hostname:
user@host-with-latest-userspace-and-kernel ~ $ uname -a
Linux host-with-latest-userspace-and-kernel 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
user@host-with-latest-userspace-and-kernel ~ $ docker run -it --cpuset-cpus=0 centos:centos7
[root@82cac19350b2 /]# grep processor /proc/cpuinfo | wc -l
12
The output of check-config.sh
ran on this host is identical to my test VM.
This also suggests that the exact CentOS version may also not matter much, as both my test VM and host-with-latest-userspace-and-kernel
are CentOS 7.2.1511, while the machine upon which I originally reported is CentOS 7.1.1503.
Just for completeness, below you will find the same info requested in the issue template, but for host-with-latest-userspace-and-kernel
Output of docker version
:
Client:
Version: 1.10.2
API version: 1.22
Go version: go1.5.3
Git commit: c3959b1
Built: Mon Feb 22 16:16:33 2016
OS/Arch: linux/amd64
Server:
Version: 1.10.2
API version: 1.22
Go version: go1.5.3
Git commit: c3959b1
Built: Mon Feb 22 16:16:33 2016
OS/Arch: linux/amd64
Output of docker info
:
Containers: 48
Running: 44
Paused: 0
Stopped: 4
Images: 9
Server Version: 1.10.2
Storage Driver: devicemapper
Pool Name: docker-253:3-134434010-pool
Pool Blocksize: 65.54 kB
Base Device Size: 107.4 GB
Backing Filesystem: ext4
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 4.418 GB
Data Space Total: 107.4 GB
Data Space Available: 10.34 GB
Metadata Space Used: 9.925 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.138 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.39 GiB
Name: redacted
ID: redacted
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
And for good measure, OS release specifics:
user@host-with-latest-userspace-and-kernel:~ $ cat /etc/*release*
CentOS Linux release 7.2.1511 (Core)
Derived from Red Hat Enterprise Linux 7.2 (Source)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
CentOS Linux release 7.2.1511 (Core)
CentOS Linux release 7.2.1511 (Core)
cpe:/o:centos:centos:7
To see if I could spot a pattern of some sort, I've tested for the presence of this on the 10 docker hosts to which I have access. The only machine on which I have not observed this issue is the clean VM I set up specifically to test this issue. Below are the configurations of the machines in question (hosts discussed above are included).
Except for the test VM, which is excluded from the machine counts in the table below, all machines tested are bare metal.
Number of Machines | Docker Version | OS | OS Version | Kernel Version |
---|---|---|---|---|
1 | 1.10.1, build 9e83765 | Ubuntu | 15.10 | 4.2.0-25-generic |
1 | 1.9.1, build a34a1d5 | Ubuntu | 15.10 | 4.2.0-30-generic |
1 | 1.7.1, build 3043001/1.7.1 | CentOS | 7.1.1503 | 3.10.0-229.11.1.el7.x86_64 |
4 | 1.8.2-el7.centos, build a01dc02/1.8.2 | CentOS | 7.2.1511 | 3.10.0-327.3.1.el7.x86_64 |
1 | 1.10.2, build c3959b1 | CentOS | 7.2.1511 | 3.10.0-327.10.1.el7.x86_64 |
2 | 1.10.2, build c3959b1 | 7.1.1503 | 3.10.0-229.14.1.el7.x86_64 |
On the off chance that there's some difference in behaviour between --cpuset
and --cpuset-cpus
, I also tested --cpuset
on one of the 4 machines running the el7 build of Docker 1.8.2. No change in behaviour.
Argh... forget everything I said about the test VM working correctly. It turns out I'd forgotten that I'd only provisioned one vcpu for the vm. Now that I've switched it to 4 vcpus, the problem occurs there, too.
I see that the proper value is being set to cpuset.cpus
on my test VM, leading me full circle back to thinking it's a kernel issue.
[bburns@localhost ~]$ cat /sys/fs/cgroup/cpuset/docker/e047d1596aac8375c6cf711c3c241c44d2404a5203e79f36469709e131ddee49/cpuset.cpus
0
And after using --cpuset-cpus=0,1
I see:
[bburns@localhost ~]$ cat /sys/fs/cgroup/cpuset/docker/731bf72f01f8c3305f3bbca1a1af4b5bc5fb8b0b752e78720528abc1c773fe2f/cpuset.cpus
0-1
I don't fully understand the patches I linked in my first comment, but I have verified that nothing like them has been applied to the CentOS kernel. In fact, there is no effective_cpus
member in the cpuset
struct in kernel 3.10.0.
So it's looking like --cpuset-cpus
does assign processor affinity correctly, however code which inspects the machine configuration still thinks it has access to the full core count of the machine.
To determine this I created two containers, one with --cpuset-cpus=0
and the other with no --cpuset-cpus
argument. In the container console I then backgrounded 4 bash while true
loops, and checked process affinity with ps -o pid,cpuid,comm
. On the container which had the --cpuset-cpus=0
arg, all cpuid
values were 0
, while on the other container multiple cpuid
values were listed.
Question: Is solving this issue in scope for docker, or is this a kernel-level problem?
Console session:
user@host ~ $ sudo docker run -it --cpuset-cpus=0 --cpuset-mems=0 centos:centos7
[root@f887dac642a6 /]# while true; do echo blah; done > /dev/null &
[1] 14
[root@f887dac642a6 /]# while true; do echo blah; done > /dev/null &
[2] 15
[root@f887dac642a6 /]# while true; do echo blah; done > /dev/null &
[3] 16
[root@f887dac642a6 /]# while true; do echo blah; done > /dev/null &
[4] 17
[root@f887dac642a6 /]# ps -o pid,cpuid,comm
PID CPUID COMMAND
1 0 bash
14 0 bash
15 0 bash
16 0 bash
17 0 bash
18 0 ps
[root@f887dac642a6 /]# exit
user@host:~$ docker run -it centos:centos7
[root@9612d2e4c7dd /]# while true; do echo blah; done > /dev/null &
[1] 14
[root@9612d2e4c7dd /]# while true; do echo blah; done > /dev/null &
[2] 15
[root@9612d2e4c7dd /]# while true; do echo blah; done > /dev/null &
[3] 16
[root@9612d2e4c7dd /]# while true; do echo blah; done > /dev/null &
[4] 17
[root@9612d2e4c7dd /]# ps -o pid,cpuid,comm
PID CPUID COMMAND
1 0 bash
16 0 bash
17 1 bash
18 2 bash
19 3 bash
20 2 ps
[root@9612d2e4c7dd /]# exit
exit
From the Ubuntu bug report in my first comment, it looks like docker can work around this issue by creating its cgroup with cpuset.clone_children
set to 0
.
Whoops, didn't mean to close.
hm, interesting, let me ping @LK4D4 and @anusha-ragunathan, perhaps they have some thoughts on that
Eh, that might be a red herring. I've tried doing this manually to no effect. Also it appears that cgroup.clone_children
is only defaulting to 1
on my Ubuntu boxes. On my CentOS hosts /sys/fs/cgroup/cpuset/docker/cgroup.clone_children
was already set to 0
.
What do you get inside the container? i.e.
docker run --rm --cpuset-cpus=0,1 ubuntu sh -c "cat /sys/fs/cgroup/cpuset/cpuset.cpus"
That command works correctly, which is good news as for the applications for which we control we can inspect this file. However for applications running in vms like mono, this will present some pain. It'd be much simpler overall if the process didn't need to be aware that it was running within a cgroup.
To add a bit of supporting info to my last statement, I grepped mono's source quickly and found that on systems with a proper glibc, mono detects the core count via sysconf(_SC_NPROCESSORS_ONLN)
. So, I wrote a quick and dirty c program to call this and print the result, copied it into a container built with --cpuset-cpus=0
, and it returns the core count of the full machine.
This can be seen in the mono source at
libgc/pthread_support.c
mono/io-layer/system.c
mono/profiler/proflog.c
mono/utils/mono-proclib.c
support/map.c
This sounds similar to #20688, and a nice article describing the situation http://fabiokung.com/2014/03/13/memory-inside-linux-containers/
Yes, it certainly does. Digging into mono source a bit further it's also parsing /proc/stat
in places.
I'll likely open an issue with mono to make the VM cgroup aware, however I agree with @thechile's last comment on #20688 that the container community ought to be working with kernel maintainers to sort out a solution to this problem.
Linus has a pretty famous rule that the kernel shouldn't break userspace. I'd think that the container shouldn't break userspace, either. You might argue that it's not the container, it's cgroups, but if the choice to use cgroups forces containerized processes to become cgroup aware, then from the perspective of the user it's the same result.
It's pain enough for native processes where I control thread pooling and resource allocation, but when you've got a full platform stack that you're trying to drop into a container it gets quite expensive quite quick.
I've raised a mono issue with the hope that they'll pick it up and at least work around this problem. That said, I'd rather not need to also raise issues for go, python, ruby, java, and so on.
@benjamincburns how did you end up working around this? As of Linux 5.1 this still occurs, which is a real pain when doing CPU pinning; inside the container you can still see all the cpus, but only the ones assigned with --cpuset-cpus
can be pinned to, the rest will error on the syscall.
@benjamincburns how did you end up working around this?
@qlyoung as far as I can remember, we didn't.
So what's the situation with this issue? I have some code that is deciding how many processes to fork based on CPU count and it's getting the wrong number of processors.
@jdmarshall based on some additional research it seems the appropriate fix for this will ultimately be, as with all things, a kernel namespace for whatever this resource class is. If you want to know what CPUs you can actually get, you can loop through each "available" core and try to bind to it with sched_setaffinity. If it works then it's available, if not then it's not available to the container. I did this for AFL, if you want an example, patch is here. So maybe for your case fork off N = # cpus processes, try sched_setaffinity
in each of them, and simply exit if it fails, then you should be left with the appropriate amount of processes.
Brendan Gregg touches on this a bit in this talk https://www.youtube.com/watch?v=bK9A5ODIgac, although it's in the context of perf
events iirc.
How JVM handles active processor count may help to you.
FYR https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/91a5bf6cd78c7c073f1e8851217ae3a2241d9441/hotspot/src/os/linux/vm/osContainer_linux.cpp#L517
On Intel machines I can get around by reading /sys/fs/cgroup/cpu/cpu.cfs_quota_us
and /sys/fs/cgroup/cpu/cpu.cfs_period_us
and divide quota with period to get the number of CPU cores allowed to use within the container.
However, this does not seem to work on Aarch64 linux machines: /sys/fs/cgroup/cpu
folder doesn't even exist. I've found that /sys/fs/cgroup/cpu.max
contains two numbers separated by whitespace which resemble cfs_quota_us
and cfs_period_us
on Intel.
Any idea why is this discrepancy between Intel and Aarch64?
I've been using 'nproc' on linux to get better behavior, and 'sysctl -n hw.logicalcpu' on OS X. I found this somewhere on stackoverflow.
Since I only really need this data at startup I just eat the child process overhead. I think standard lib writers are getting wise to this though. I think Node introduced a fix for this in the previous major version.
@thaJeztah how are we feeling about this issue these days?
Initially I'd hoped that there would be some way that the docker could be made to work with legacy software that was written prior to cgroups existing, as well as software that was written to erroneously assume that it could make use of all cores on the host.
Ultimately it would seem that the path to achieving this goal is rooted in how cgroups restrictions are exposed by the kernel to the user space processes that are subject to those restrictions. As a result of that, I'm not sure that there's anything for the container engine to do here. I'm also no longer sure that the goal as I just stated it is even desirable, let alone achievable. That is, there's a distinct difference between "the set of CPUs available to the host" and "the set of CPUs that a process can access," and that's true in a wide variety of scenarios that have nothing to do with containerization.
With that in mind, I think this is a discussion for the kernel mailing lists, if it's even a discussion worth having. Unfortunately I don't really have the time or motivation right now to champion that conversation, but I'd encourage anyone who finds this issue to be important to take it up there.
In the meantime, I think it's probably best to close this issue. @thaJeztah if you or any other maintainers feel otherwise, please feel free to reopen.
PS:
- lxcfs does exactly this and it can be used to "give" docker this ability
- This feature is considered for Sysbox, an alternative container runtime for docker, owned by Docker Inc:
Thanks for the additional context, @felipecrs. I wish that lxcfs
existed (or that I was aware of its existence) back when I was having this issue in 2016!
Just for clarity, are you advocating for this issue to remain open, in light of the tooling you posted?
I just worry that making this behaviour a default in moby could be problematic. For example, I think it's not uncommon for k8s clusters to set affinity for privileged cluster management & host monitoring jobs to a set of reserved CPUs that aren't used for other workloads (guarantees liveness, minimises the impact of monitoring on latency-sensitive workloads, etc).
If it were something that wasn't on by default, but could be optionally set on a container-by-container basis, that could still add utility, however.
I would love for this feature to be baked into docker, rather than having to rely on external tools that are (very) convoluted to setup.
Being able to specify it on a container-by-container basis would be the ideal, like docker run --mask-procfs
.
Then, making it the default would be a whole different conversation that can start once such feature exist. In my limited, personal gut feeling I believe it would be nicer to be the default behavior. But I do not want to argue about it.
Just for clarity, are you advocating for this issue to remain open, in light of the tooling you posted?
To be honest I'm not advocating for this issue to remain open as I have zero hope that Docker would ever implement it.