cloudfoundry-attic/warden

warden oom kills app before memory threshold crossed

Closed this issue · 5 comments

from @gberche-orange on vcap-dev thread labeled:
"Understanding out of memory errors in crashed apps"

there is a corresponding story in the pivotal tracker for the runtime team:
"warden oom kills app before memory threshold crossed"


I completed the diagnostic script [19] which collects process state from within the buildpack sampled every second or so and posts it to gist. According to these measurements, the jvm appears to be killed (by oom killer?) while the expected 512MB memory limit does not seem reached. This forces me to the oversize the instances to 1024MB or higher to start without failure. I either suspect something related to swap, or a bug in warden/cgroups. Any analysis and help is welcome. Full details follow.

The diagnostic script collects process state using the command "vmstat;ps -AFH --cols=2000;free".
An external script stops the app when all 3 instances are reported as running, this consequently stops the sampling at that time.

The gist naming format contains the instance index, instance id and start variables

In this run on run.pivotal.io, these are 3 instances of 512MB started. Both instance 0 and 1 experience OOMs [20] while "cf stats" qtill report a low memory usage. The last version of following gists show the last sample of these instances, while their history preserve older recorded samples:

diagnostics-Index0-Id7a194f92d1796d75400f46eb0e49f98c-Start2013-09-12 17:37:17 +0000
diagnostics-Index1-Id7a28ea1ef09f4a8802c43149aedc217d-Start2013-09-12 17:38:07 +0000
diagnostics-Index2-Id779a9d9408251c182bb0bbde2eeb6c15-Start2013-09-12 17:38:07 +0000

both instance 0 and 1 are started later on a second time into distinct instances (suposidely on HM request):
diagnostics-Index0-Id85d1dd33c9f7452d56d63bcca8f0f01f-Start2013-09-12 17:41:39 +0000
diagnostics-Index1-Id1822f81aaa2902b3a00d69445a5281cc-Start2013-09-12 17:38:31 +0000

For the first instance 0 (Id 7a194), we see that the jvm (PID 137) gets killed at sample #7, i.e. approx 8 seconds after startup:
https://gist.github.com/gberche-orange/8360aebb72d0421861a4/revisions
In sample #6, just before being killed the jvm was using 233MB rss, and all processes were summing up to 252 MB rss.

The warden limit are set on memory.memsw.limit_in_bytes and memory.limit_in_bytes which includes memory+swap according to [22]. I'm having a hard time to interpret the ps man page [21] to properly measure the swap memory for my processes. Maybe [23] could help. Let me know if you have suggestions. The free and vmstat reported metrics don't seem to take be scoped to the current cgroup, but they however show a steady swap usage.

[19] https://github.com/Orange-OpenSource/java-buildpack/blob/master/resources/jonas/diagnostics/diagnostics.rb

[20]

$ date;cf instances myapp; cf events myapp ;cf stats myapp
Thu Sep 12 17:42:05 UTC 2013

Warning: url is not a valid manifest attribute. Please remove this attribute from your manifest to get rid of this warning
Getting instances for myapp... OK

instance #0: running
started: 2013-09-12 05:42:02 PM

instance #1: running
started: 2013-09-12 05:38:48 PM

instance #2: running
started: 2013-09-12 05:38:31 PM

Warning: url is not a valid manifest attribute. Please remove this attribute from your manifest to get rid of this warning
Getting events for myapp... OK

time instance index description exit status
2013-09-12T17:37:33+00:00 0 out of memory Failure (137)
2013-09-12T17:38:30+00:00 1 out of memory Failure (137)

Warning: url is not a valid manifest attribute. Please remove this attribute from your manifest to get rid of this warning
Getting stats for myapp... OK

instance cpu memory disk
#0 0.8% 268.8M of 512M 297.4M of 1G
#1 0.1% 318.3M of 512M 297.5M of 1G
#2 0.1% 302.5M of 512M 297.5M of 1G

[21] ps man page extracts on lucid 64:

Note:
Only anonymous and swap cache memory is listed as part of 'rss' stat.
This should not be confused with the true 'resident set size' or the
amount of physical memory used by the cgroup.
'rss + file_mapped" will give you resident set size of cgroup.
(Note: file and shmem may be shared among other cgroups. In that case,
file_mapped is accounted only when the memory cgroup is owner of page
cache.)

   rss       RSS    resident set size, the non-swapped physical memory that a
                    task has used (in kiloBytes). (alias rssize, rsz).

  sz        SZ     size in physical pages of the core image of the process.
                    This includes text, data, and stack space. Device mappings
                    are currently excluded; this is subject to change. See vsz
                    and rss.
  size      SZ     approximate amount of swap space that would be required if
                    the process were to dirty all writable pages and then be
                    swapped out. This number is very rough!

   vsz       VSZ    virtual memory size of the process in KiB
                    (1024-byte units). Device mappings are currently excluded;
                    this is subject to change. (alias vsize).

[22] https://www.kernel.org/doc/Documentation/cgroups/memory.txt
[23] http://www.cyberciti.biz/faq/linux-which-process-is-using-swap/

Thanks James.

Another note in https://www.kernel.org/doc/Documentation/cgroups/memory.txt which could potentially explain the discrepancy between OOM and theoretical limit:

A successful write to this file does not guarantee a successful setting of
this limit to the value written into the file. This can be due to a
number of factors, such as rounding up to page boundaries or the total
availability of memory on the system. The user is required to re-read
this file after a write to guarantee the value committed by the kernel.

I'm not yet quite sure, but reading https://github.com/cloudfoundry/warden/blob/98f5e786f096e667e97b931d4416bcc1b76b5153/warden/lib/warden/container/features/mem_limit.rb#L110 it seems the file is not read back. If total memory is insufficient (e.g. in case of containers overprovisionning) then it could be possible that the requested limit is not honored, and rounded lower.

Hi @gberche-orange,

Starting with 'cf stats', the data that's reported by that command can be up to 10 seconds old. As such, the info you get back can be quite stale - especially given you've indicated the processes die within 8-10 seconds.

Moving on to the script that you ran in the containers, it turns out that vmstat and free both run against the /proc file system. In the warden containers, the /proc file system actually provides data about the host environment - not the container; that means the only information that's accurate is from ps.

When we looked at the output of the ps command, just before the termination, it looked like your RSS for Java alone was about 230MB. While that's far less than 512, it's quite possible that the 512 limit was crossed during the interval between sample 6 and 7. That makes sense given your java options are allocating a 384MB heap and 50M permgen area. When we do the math, that leaves about 75M of white space - and that's not a lot.

Based on our investigation and experiments, we really think this is working as expected.

We are going to close this issue. If you still think it is a problem, you can try to sample at a higher rate (than 1s) and you may want to consider dumping the memory map from /proc/${pid}/maps as that would help show the allocations.

Would it be possible to just log the details of the memory limit breached before the container gets killed, we're experiencing exactly the same scenario that @gberche-orange experienced 3 years ago.

There is no additional work being done on warden. There are no details. We normally see one of two behaviors. The container is killed before we are told it is due to an oom situation or we log that an oom situation occurred which is all we are told.

Thanks