apptainer/singularity

Singularity, cgroup memory.limits, mmaped strangeness

pja237 opened this issue · 8 comments

Apology

I'm opening this issue even though it seems to be related, or might be the one causing similar ones:

#5041
#5800

Perhaps these findings will help you help us understand what is going on and how we could mitigate against these situations.

Version of Singularity:

Reproduced successfully with two versions:

singularity version 3.6.4-1.el7
singularity version 3.7.1-1.el7

OS:

CentOS Linux release 7.9.2009 (Core)
3.10.0-1127.19.1.el7.x86_64

Expected behavior

When singularity processes running in cgroup reach memory.limit_in_bytes they should be killed by oom.
In some cases this happens, although we noticed that in the following specific one this does not, and it causes several quite negative effects on the nodes where it doesn't.

Actual behavior

When the running processes in question use the following mmap/memset pattern (probably not limited to):

res=(char *) mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0);
memset(res, 1, size);

...this somehow blocks the cgroup/oom mechanism from kicking in (details below in "steps to reproduce"), drops processes from the cgroup in uninterruptable sleep, produces sudden IO load on the node and in some cases completely render the node unusable.

In real workloads we have experienced:

  • complete unresponsiveness of the node
  • ssh timeouts
  • sssd timeouts
  • /proc/PID/stack hanging on:
[<ffffffffc068b725>] squashfs_cache_get+0x105/0x3c0 [squashfs]
[<ffffffffc068bff1>] squashfs_get_datablock+0x21/0x30 [squashfs]
[<ffffffffc068d272>] squashfs_readpage+0x8a2/0xc30 [squashfs]
[<ffffffffaa7cb748>] __do_page_cache_readahead+0x248/0x260
[<ffffffffaa7cbd01>] ra_submit+0x21/0x30
[<ffffffffaa7c0e75>] filemap_fault+0x105/0x420
[<ffffffffaa7edf6a>] __do_fault.isra.61+0x8a/0x100
[<ffffffffaa7ee51c>] do_read_fault.isra.63+0x4c/0x1b0
[<ffffffffaa7f5d80>] handle_mm_fault+0xa20/0xfb0
[<ffffffffaad8d653>] __do_page_fault+0x213/0x500
[<ffffffffaad8da26>] trace_do_page_fault+0x56/0x150
[<ffffffffaad8cfa2>] do_async_page_fault+0x22/0xf0
[<ffffffffaad897a8>] async_page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

note: this code pattern is the one tested as poc, but could possibly be that file backed mmap and/or other mem*() functions have same/similar effect.

Steps to reproduce this behavior

Download and compile mempoc.c (https://gist.github.com/pja237/b0e9a49be64a20ad1af905305487d41a).

NOTE: touching memory in the commented for-loop:

https://gist.github.com/pja237/b0e9a49be64a20ad1af905305487d41a#file-mempoc-c-L41

DOES NOT PRODUCE THE ISSUE, oom handles this perfectly!

gcc -Wall -o mempoc mempoc.c

Versions:

[root@stg-c2-1 ~]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
[root@stg-c2-1 ~]# uname -r
3.10.0-1127.19.1.el7.x86_64
[root@stg-c2-1 ~]# singularity --version
singularity version 3.7.1-1.el7

Setup cgroup:

(e.g.. 5 GB for this case)

[root@stg-c2-1 ~]# cd /sys/fs/cgroup/memory/
[root@stg-c2-1 memory]# mkdir test
[root@stg-c2-1 memory]# cd test
[root@stg-c2-1 test]# cat memory.limit_in_bytes 
9223372036854771712
[root@stg-c2-1 test]# free -g
              total        used        free      shared  buff/cache   available
Mem:            169           2         166           0           0         165
Swap:             0           0           0
[root@stg-c2-1 test]# echo $((5*1024*1024*1024)) > memory.limit_in_bytes
[root@stg-c2-1 test]# cat memory.limit_in_bytes 
5368709120
[root@stg-c2-1 test]# cat cgroup.procs 
[root@stg-c2-1 test]# echo $$ > cgroup.procs 
[root@stg-c2-1 test]# cat cgroup.procs 
8593
10051

Run mempoc without singularity

Spin up 10 children each mmaping and memsetting 2GB (total: 20GB).
Expected result: 8 get oom, 2*2GB remain alive.

[root@stg-c2-1 test]# /tmp/mempoc
Syntax ./mempoc SIZE_IN_GB NUM_CHILDREN
[root@stg-c2-1 test]# /tmp/mempoc 2 10
Calling mmap 2 from child 0 with 2147483648 bytes...
Waiting 10 seconds...
...

observing oom kills we see the expected behaviour (8 killed):

[root@stg-c2-1 ~]# /usr/share/bcc/tools/oomkill 
Tracing OOM kills... Ctrl-C to stop.
13:14:26 Triggered by PID 10695 ("mempoc"), OOM kill of PID 10697 ("mempoc"), 1310720 pages, loadavg: 0.05 0.30 0.31 3/792 10706
13:14:26 Triggered by PID 10699 ("mempoc"), OOM kill of PID 10696 ("mempoc"), 1310720 pages, loadavg: 0.05 0.30 0.31 2/791 10706
13:14:27 Triggered by PID 10702 ("mempoc"), OOM kill of PID 10695 ("mempoc"), 1310720 pages, loadavg: 0.05 0.30 0.31 3/790 10706
13:14:27 Triggered by PID 10700 ("mempoc"), OOM kill of PID 10702 ("mempoc"), 1310720 pages, loadavg: 0.05 0.30 0.31 2/789 10706
13:14:27 Triggered by PID 10699 ("mempoc"), OOM kill of PID 10703 ("mempoc"), 1310720 pages, loadavg: 0.05 0.30 0.31 2/787 10706
13:14:28 Triggered by PID 10701 ("mempoc"), OOM kill of PID 10700 ("mempoc"), 1310720 pages, loadavg: 0.05 0.30 0.31 2/786 10706
13:14:28 Triggered by PID 10698 ("mempoc"), OOM kill of PID 10699 ("mempoc"), 1310720 pages, loadavg: 0.05 0.30 0.31 3/785 10706
13:14:28 Triggered by PID 10698 ("mempoc"), OOM kill of PID 10698 ("mempoc"), 1310720 pages, loadavg: 0.05 0.30 0.31 2/784 10706

And two remain (they fit in 5 GB cgroup):

[root@stg-c2-1 ~]# ps fa
  PID TTY      STAT   TIME COMMAND
 9092 pts/2    Ss     0:00 -bash
10693 pts/2    S+     0:00  \_ /usr/bin/python /usr/share/bcc/tools/oomkill
 8593 pts/1    Ss     0:00 -bash
10824 pts/1    S+     0:00  \_ /tmp/mempoc 2 10
10829 pts/1    S+     0:03      \_ /tmp/mempoc 2 10
10834 pts/1    S+     0:02      \_ /tmp/mempoc 2 10
 8548 pts/0    Ss     0:00 -bash
11221 pts/0    R+     0:00  \_ ps fa

Run mempoc with singularity

Singularity image built from:

[root@stg-c2-1 test]# cat /tmp/test.sing
Bootstrap: docker
From: centos:7

%post
    yum install -y epel-release
    yum install -y stress-ng strace

%environment
    export LC_ALL=C
    export PATH=/usr/games:$PATH

%files
    a.out /usr/local/bin

%runscript
    exec $@

Run mempoc with singularity and wait a bit...

[root@stg-c2-1 test]# singularity run /tmp/test.sif /tmp/mempoc 2 10
Calling mmap 2 from child 0 with 2147483648 bytes...
Waiting 10 seconds...
...

Processes are now being in uninterruptible sleep instead of being killed,
, or in some cases we had oom kill fire up, but with no effect.

[root@stg-c2-1 ~]# ps fa
  PID TTY      STAT   TIME COMMAND
 9092 pts/2    Ss     0:00 -bash
11532 pts/2    S+     0:01  \_ /usr/bin/python /usr/share/bcc/tools/oomkill
 8593 pts/1    Ss     0:00 -bash
11533 pts/1    Sl+    0:00  \_ Singularity runtime parent
11550 pts/1    S+     0:00      \_ /tmp/mempoc 2 10
11570 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
11571 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
11572 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
11573 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
11574 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
11575 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
11576 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
11577 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
11578 pts/1    D+     0:01          \_ /tmp/mempoc 2 10
11579 pts/1    D+     0:00          \_ /tmp/mempoc 2 10
 8548 pts/0    Ss     0:00 -bash
11582 pts/0    R+     0:00  \_ ps fa

Iostat is showing IO load which did not exist on the machine.

[root@stg-c2-1 ~]# iostat -m 1
Linux 3.10.0-1127.19.1.el7.x86_64 (stg-c2-1.cbe.staging.clip.vbc.ac.at)         02/24/2021      _x86_64_        (76 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          27.78    0.00    0.02    0.00    0.01   72.19

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda               2.42         0.01         0.02      16857      26788
scd0              0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    0.49    0.60    0.00   98.90

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda            2123.53        76.41         0.01         77          0
scd0              0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.48    1.07    0.00   97.45

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda            7891.92       112.57         0.01        111          0
scd0              0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    1.35    0.29    0.00   98.34

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda            1600.99        67.01         0.01         67          0
scd0              0.00         0.00         0.00          0          0

Other method to reproduce

Install stress-ng package from epel (or similar) and run:

stress-ng --vm 10 --vm-bytes 5G --vm-keep &

How did you install Singularity

rpm from epel:

[root@stg-c2-1 test]# rpm -qi singularity
Name        : singularity
Version     : 3.7.1
Release     : 1.el7
Architecture: x86_64
Install Date: Wed 10 Feb 2021 11:41:15 AM CET
Group       : Unspecified
Size        : 124581035
License     : BSD-3-Clause-LBNL
Signature   : RSA/SHA256, Wed 13 Jan 2021 04:26:22 PM CET, Key ID 6a2faea2352c64e5
Source RPM  : singularity-3.7.1-1.el7.src.rpm
Build Date  : Tue 12 Jan 2021 09:22:59 PM CET
Build Host  : buildhw-x86-16.iad2.fedoraproject.org
Relocations : (not relocatable)
Packager    : Fedora Project
Vendor      : Fedora Project
URL         : https://www.sylabs.io/singularity/
Bug URL     : https://bugz.fedoraproject.org/singularity
Summary     : Application and environment virtualization
Description :
Singularity provides functionality to make portable
containers that can be used across host environments.

UPDATE:

we tested this on a rhel 8 box with good results, under memory pressure, even with singularity oom was doing its job:

test box:

root@test:/sys/fs/cgroup/memory#cat /etc/redhat-release
Red Hat Enterprise Linux release 8.3 (Ootpa)
root@test:/sys/fs/cgroup/memory#uname -r
4.18.0-240.10.1.el8_3.x86_64
root@test:/sys/fs/cgroup/memory#singularity --version
singularity version 3.7.1

setup:

root@test:/sys/fs/cgroup/memory#mkdir test
root@test:/sys/fs/cgroup/memory#cd test/
root@test:/sys/fs/cgroup/memory/test#echo $((5*1024*1024*1024)) > memory.limit_in_bytes
root@test:/sys/fs/cgroup/memory/test#echo 0 > memory.swappiness

Same test, 5 GB cgroup, 10x2GB stressors, expecting to see 2 unkilled, but...

root@test:/sys/fs/cgroup/memory/test#singularity run ~/src/memtest/test.sif ~/src/memtest/mempoc 2 10
Calling mmap 2 from child 0 with 2147483648 bytes...
Calling mmap 2 from child 1 with 2147483648 bytes...
Waiting 10 seconds...
Waiting 10 seconds...
Calling mmap 2 from child 2 with 2147483648 bytes...
Waiting 10 seconds...
Calling mmap 2 from child 3 with 2147483648 bytes...
Waiting 10 seconds...
Calling mmap 2 from child 4 with 2147483648 bytes...
Calling mmap 2 from child 5 with 2147483648 bytes...
Waiting 10 seconds...
Waiting 10 seconds...
Calling mmap 2 from child 6 with 2147483648 bytes...
Waiting 10 seconds...
Calling mmap 2 from child 7 with 2147483648 bytes...
Waiting 10 seconds...
Calling mmap 2 from child 8 with 2147483648 bytes...
Waiting 10 seconds...
Calling mmap 2 from child 9 with 2147483648 bytes...
Waiting 10 seconds...
Filling pages with memset...
Filling pages with memset...
Filling pages with memset...
Filling pages with memset...
Filling pages with memset...
Filling pages with memset...
Filling pages with memset...
Filling pages with memset...
Filling pages with memset...
Filling pages with memset...
root@test:/sys/fs/cgroup/memory/test#

interesting to note, oom kills ALL of the processes inside for some reason here, doesn't leave 2 running as on rhel 7:

14:52:22 Triggered by PID 2336201 ("mempoc"), OOM kill of PID 2336195 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 6/354 2336211
14:52:22 Triggered by PID 2336199 ("mempoc"), OOM kill of PID 2336200 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 10/354 2336211
14:52:22 Triggered by PID 2336202 ("mempoc"), OOM kill of PID 2336199 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 9/354 2336211
14:52:22 Triggered by PID 2336194 ("mempoc"), OOM kill of PID 2336194 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 13/354 2336211
14:52:22 Triggered by PID 2336198 ("mempoc"), OOM kill of PID 2336198 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 11/354 2336211
14:52:22 Triggered by PID 2336196 ("mempoc"), OOM kill of PID 2336202 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 11/354 2336211
14:52:22 Triggered by PID 2336196 ("mempoc"), OOM kill of PID 2336197 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 5/348 2336211
14:52:22 Triggered by PID 2336193 ("mempoc"), OOM kill of PID 2336196 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 4/348 2336211
14:52:22 Triggered by PID 2336201 ("mempoc"), OOM kill of PID 2336201 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 6/348 2336211
14:52:22 Triggered by PID 2336193 ("mempoc"), OOM kill of PID 2336193 ("mempoc"), 1310720 pages, loadavg: 0.27 0.52 0.32 6/348 2336211

The behavior where the OOM killing works on RHEL8 but not on RHEL 7 is what we've observed previously. As best we can tell at present this is a kernel version-specific issue related to cgroups memory handling with namespaces in use, and not a Singularity issue.

I am able to reproduce with other means (a different test program) reliably on RHEL7, but not on RHEL8 or Fedora 33. It also does not reproduce if I manually install a mainline kernel on RHEL7.

However, there is conflicting info in #5800 where a new kernel on Ubuntu 18.04 did not address the problem.

Hello,

This is a templated response that is being sent out to all open issues. We are working hard on 'rebuilding' the Singularity community, and a major task on the agenda is finding out what issues are still outstanding.

Please consider the following:

  1. Is this issue a duplicate, or has it been fixed/implemented since being added?
  2. Is the issue still relevant to the current state of Singularity's functionality?
  3. Would you like to continue discussing this issue or feature request?

Thanks,
Carter

Hello Carter,

sorry for the late reply, here is the summary of the situation regarding this issue.

As David above confirmed, seems like this issue is the result of the interplay between 3.x kernel cgroups memory controller in centos7 and singularity in memory pressure situation.
Whether singularity can work around this issue we do not know and have not received any feedback about so far.

Since this issue kind of went stale here, we'll eventually do an upgrade to our cluster to get the kernel to at least 4.x in centos/rhel 8 where the oom subsystem is reworked and this issue doesn't exist anymore.

Until then, we're also experimenting with an alternative workaround we developed here: https://github.com/pja237/kp_oom

If there are any news on this topic from your side, we'd be happy to further discuss it, otherwise i guess you can close the ticket.

Best,
Petar

stale commented

This issue has been automatically marked as stale because it has not had activity in over 60 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale commented

This issue has been automatically closed because no response was provided within 7 days.

nlvw commented

Until then, we're also experimenting with an alternative workaround we developed here: https://github.com/pja237/kp_oom

@pja237 Have you rolled that fix to production or is there another workaround to this issue. We ran into this the hard way on our HPC cluster, after rolling out Singularity, and have yet to determine a fix. Upgrading our cluster from RHEL7 to RHEL8 is not possible, at least for the next 6 months.

Hey @nlvw,

in the lack of better (any) solutions, and needing to postpone the upgrade to rh8 on the cluster to 2022 (expected q2-3),
we rolled the kp_oom module out to our production cluster on 21.5.2021.
The roll out was done in stages, we did a test subset of nodes first, monitoring for reboots (prometheus node-exporter),
and after a successful test period, went all-in by June.

The biggest risk of this method is that in case something goes wrong, there is no gentle way to recover, kernel panics and node reboots. We accepted it since the hard-reboot was the worst case manifestation of this issue anyways.
In our test, and basically since, we did not caught a case of the panic yet on our regular partitions/software (* see one possible exception in "experiences" below).

Since May 2021, we've upgraded the kernel, slurm and singularity multiple times so atm we're running:

cluster: 205 nodes cluster (8 different hardware configurations, with and without gpus)
jobs: average 35k jobs daily
dist: CentOS Linux release 7.9.2009 (Core)
kernel: 3.10.0-1160.53.1.el7.x86_64
singularity: 3.8.3-1.el7
slurm: 20.11.5-1

Experiences so far:

  1. besides no3 in this list, it was pretty uneventful so far, we only had to adopt it couple of times to cater to some specific use cases, like: "running singularity inside tmux in an interactive srun on the node", but we had no other issues with it

To give you a feel for how often we get it, i just did a dmesg|grep -E "KP_OOM.*SIGKILL"|wc -l and summed it across the cluster, and we have 24547 sigkills distributed by kp_oom.
(we're in the middle of rebooting the nodes for a kernel upgrade, so this number is normally much higher)

  1. nextflow pipelines caught in mem-pressure and killed by kp_oom can't recover, due to nf "monitoring architecture" which trap and track jobs status on the nodes in files

    • in case of SIGKILL, this can't work anymore) - user will get his job killed and the node will survive (as expected), but nf will not register this and resubmit his job because it didn't get anything in those files (sigkill)
  2. and for the first time, last week, on our CERN grid partition running CVMFS software, we have noticed one particular software that kp_oom can't terminate with SIGKILL (probably because heavy IO or some other D/blocking activity) and these nodes eventually end up in kernel panic and reboot (presumably something that we would have to do manually anyways).
    Since we could ask the grid operators to increase the default memory request by these jobs, and it is for now the first of such cases, we will not follow this up.

Hope this helps you.