Memory usage keeps increasing until OOM
ciprian2k opened this issue ยท 15 comments
Describe the bug
Falco memory usage keeps increasing until OOM
How to reproduce it
Create a custom rule "command_args.yaml"
- rule: Suspicious Command Args Detected
desc: Detects suspicious commands
condition: >
proc.args contains "--lua-exec"
enabled: true
output: >
Suspicious command detected (user=%user.name command=%proc.cmdline)
priority: WARNING
tags: [host, data, mitre_discovery]
Run echo multiple times and see memory increase until OOM
watch -n 0.1 "echo --lua-exec"
Environment
- Falco version:
Tue Jul 2 07:22:49 2024: Falco version: 0.37.1 (x86_64)
Tue Jul 2 07:22:49 2024: Falco initialized with configuration file: /etc/falco/falco.yaml
Tue Jul 2 07:22:49 2024: System info: Linux version 5.14.0-362.18.1.el9_3.x86_64 (mockbuild@x64-builder02.almalinux.org) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024
{"default_driver_version":"7.0.0+driver","driver_api_version":"8.0.0","driver_schema_version":"2.0.0","engine_version":"31","engine_version_semver":"0.31.0","falco_version":"0.37.1","libs_version":"0.14.3","plugin_api_version":"3.2.0"}
- System info:
Tue Jul 2 07:23:18 2024: Falco version: 0.37.1 (x86_64)
Tue Jul 2 07:23:18 2024: Falco initialized with configuration file: /etc/falco/falco.yaml
Tue Jul 2 07:23:18 2024: System info: Linux version 5.14.0-362.18.1.el9_3.x86_64 (mockbuild@x64-builder02.almalinux.org) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024
Tue Jul 2 07:23:18 2024: Loading rules from file /etc/falco/falco_rules.yaml
{
"machine": "x86_64",
"nodename": "gen-alma923770-all-dev.mwp-nightswatch.e5.c.emag.network",
"release": "5.14.0-362.18.1.el9_3.x86_64",
"sysname": "Linux",
"version": "#1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024"
}
- Cloud provider or hardware configuration:
- OS:
NAME="AlmNAME="AlmaLinux"
VERSION="9.3 (Shamrock Pampas Cat)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.3 (Shamrock Pampas Cat)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"
ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
aLinux"
VERSION="9.3 (Shamrock Pampas Cat)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.3 (Shamrock Pampas Cat)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"
ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
- Kernel:
Linux gen-alma923770-all-dev.mwp-nightswatch.e5.c.emag.network 5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024 x86_64 x86_64 x86_64 GNU/Linux
- Installation method:
Installation from source
Hi! Thanks for opening this issue! So, it seems there might be a memleak when the rule triggers.
Can you test the same with latest Falco 0.38.1? Thank you very much for reporting!
Also, in case it is still present, can you share the configuration too? Or you are using the default one?
So, after
Events detected: 7921227
Rule counts by severity:
WARNING: 7921227
Triggered rules by rule name:
Suspicious Command Args Detected: 7921227
I see a +8M increase in resident memory:
160604 root 20 0 2471164 214944 193440 S 26,2 0,3 0:11.68 falco
160604 root 20 0 2479436 222784 193440 S 30,8 0,3 11:38.71 falco
We got a problem, Houston. But not that big, at least here.
EDIT: going to run with valgrind massif tool to check if we can easily spot the leak!
Ok on a second thought ,considering that i am running
watch -n 0.1 "echo --lua-exec"
i'd expect around 10 events per-second that means 36k events per-hour. How could i reach 8 millions events in like 30minutes ๐คฃ
Hi @FedeDP,
Thanks for investigating my problem. I've tested now on Falco 0.38.1 and it has the same issue.
Digging more into the problem, I found out that the memory leak is because I have http_output enabled.
http_output:
enabled: true
url: http://samplemywebsite.com/api/falco
This is the only difference in configuration vs the default one.
I confirm I can reproduce the memory leak. I used the exact rule and a pod running with while true; do echo "-- lua-exec
; done`.
The memory usage increases til an OOM:
- containerID: containerd://bc51e480adba8a724c297ca9481c6d463c2f0cf556bf61bc37e1af77cf7d6686
image: docker.io/falcosecurity/falco-no-driver:0.38.1
imageID: docker.io/falcosecurity/falco-no-driver@sha256:a59cadbaf556c05296dfc8f522786b2138404814797ffbc9ee3b26b336d06903
lastState:
terminated:
containerID: containerd://9e7c69c0b51f9c8a014a35a1b2adfa11277fc3a188e65f04e0f09ef4c2238b9e
exitCode: 137
finishedAt: "2024-07-02T11:02:50Z"
reason: OOMKilled
startedAt: "2024-07-02T10:37:23Z"
I will test without the http_output.enabled=true
.
Thank you both very much! I will give it a look and report back :)
Out of curiosity, which libcurl version are you using? The bundled one or the system one?
EDIT: Anyway, i am able to reproduce by enabling http output
So, it seems like there is something wrong in the curl_easy_perform
call here: https://github.com/falcosecurity/falco/blob/master/userspace/falco/outputs_http.cpp#L118
Since commenting it fixes the issue (well, http output does nothing then). I am still digging!
So, i tried to repro this with a minimal libcurl-only example but couldn't.
Then, i remembered that our outputs queue is unbounded by default and it means that it grows indefinitely; the rule you provided does not specify any syscall thus it matches every syscall/action made by the process called with args --lua-exec
, that's why it generates so many output events.
TLDR: setting outputs_queue.capacity
to eg: 100 in Falco config fixes the "issue".
But please mind that this is not an issue, it is by design behavior, exacerbated by the very wide condition of the rule.
Hi,
You are right, setting outputs_queue capacity in config resolves "my problem".
Didn't know why the memory was increasing, really thought it was a memory leak.
Thank you again for your help and sorry for the time spent on this matter.
No problem sir, thanks for asking!
/milestone 0.39.0
Hi @FedeDP , sry for bring this up, I met the same issue on 0.38.0, but may I know that if I set outputs_queue.capacity to some fixed value, does it mean falco will drop some events if cap is met? If yes, do we have some other options to mitigate this OOM issue?
The diff of our env is that we have many network traffic incoming/outcoming
does it mean falco will drop some events if cap is met
Yes, exactly.
If yes, do we have some other options to mitigate this OOM issue?
Unfortunately no; well if your system is generating too many events perhaps some rule is too noisy and must be stricter.
does it mean falco will drop some events if cap is met
Yes, exactly.
If yes, do we have some other options to mitigate this OOM issue?
Unfortunately no; well if your system is generating too many events perhaps some rule is too noisy and must be stricter.
got it thanks for answering. do we have any metrics we can use to monitor when a fixed value is chosen? i read https://falco.org/docs/metrics/falco-metrics/ but having a hard time to understand what metric's meaning actually, like falcosecurity_scap_n_retrieve_evts_drops_total and falcosecurity_scap_n_store_evts_drops_total, the difference between it and etc