falcosecurity/falco

Memory usage keeps increasing until OOM

ciprian2k opened this issue ยท 15 comments

Describe the bug

Falco memory usage keeps increasing until OOM

How to reproduce it

Create a custom rule "command_args.yaml"

  • rule: Suspicious Command Args Detected
    desc: Detects suspicious commands
    condition: >
    proc.args contains "--lua-exec"
    enabled: true
    output: >
    Suspicious command detected (user=%user.name command=%proc.cmdline)
    priority: WARNING
    tags: [host, data, mitre_discovery]

Run echo multiple times and see memory increase until OOM

watch -n 0.1 "echo --lua-exec"

Screenshots
image

Environment

  • Falco version:

Tue Jul 2 07:22:49 2024: Falco version: 0.37.1 (x86_64)
Tue Jul 2 07:22:49 2024: Falco initialized with configuration file: /etc/falco/falco.yaml
Tue Jul 2 07:22:49 2024: System info: Linux version 5.14.0-362.18.1.el9_3.x86_64 (mockbuild@x64-builder02.almalinux.org) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024
{"default_driver_version":"7.0.0+driver","driver_api_version":"8.0.0","driver_schema_version":"2.0.0","engine_version":"31","engine_version_semver":"0.31.0","falco_version":"0.37.1","libs_version":"0.14.3","plugin_api_version":"3.2.0"}

  • System info:

Tue Jul 2 07:23:18 2024: Falco version: 0.37.1 (x86_64)
Tue Jul 2 07:23:18 2024: Falco initialized with configuration file: /etc/falco/falco.yaml
Tue Jul 2 07:23:18 2024: System info: Linux version 5.14.0-362.18.1.el9_3.x86_64 (mockbuild@x64-builder02.almalinux.org) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024
Tue Jul 2 07:23:18 2024: Loading rules from file /etc/falco/falco_rules.yaml
{
"machine": "x86_64",
"nodename": "gen-alma923770-all-dev.mwp-nightswatch.e5.c.emag.network",
"release": "5.14.0-362.18.1.el9_3.x86_64",
"sysname": "Linux",
"version": "#1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024"
}

  • Cloud provider or hardware configuration:
  • OS:

NAME="AlmNAME="AlmaLinux"
VERSION="9.3 (Shamrock Pampas Cat)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.3 (Shamrock Pampas Cat)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"
ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
aLinux"
VERSION="9.3 (Shamrock Pampas Cat)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.3 (Shamrock Pampas Cat)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"
ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"

  • Kernel:

Linux gen-alma923770-all-dev.mwp-nightswatch.e5.c.emag.network 5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

  • Installation method:

Installation from source

Hi! Thanks for opening this issue! So, it seems there might be a memleak when the rule triggers.
Can you test the same with latest Falco 0.38.1? Thank you very much for reporting!

Also, in case it is still present, can you share the configuration too? Or you are using the default one?

So, after

Events detected: 7921227
Rule counts by severity:
WARNING: 7921227
Triggered rules by rule name:
Suspicious Command Args Detected: 7921227

I see a +8M increase in resident memory:

160604 root 20 0 2471164 214944 193440 S 26,2 0,3 0:11.68 falco
160604 root 20 0 2479436 222784 193440 S 30,8 0,3 11:38.71 falco

We got a problem, Houston. But not that big, at least here.

EDIT: going to run with valgrind massif tool to check if we can easily spot the leak!

Ok on a second thought ,considering that i am running

watch -n 0.1 "echo --lua-exec"

i'd expect around 10 events per-second that means 36k events per-hour. How could i reach 8 millions events in like 30minutes ๐Ÿคฃ

Hi @FedeDP,

Thanks for investigating my problem. I've tested now on Falco 0.38.1 and it has the same issue.

Digging more into the problem, I found out that the memory leak is because I have http_output enabled.

http_output:
enabled: true
url: http://samplemywebsite.com/api/falco

This is the only difference in configuration vs the default one.

I confirm I can reproduce the memory leak. I used the exact rule and a pod running with while true; do echo "-- lua-exec; done`.

The memory usage increases til an OOM:
image

  - containerID: containerd://bc51e480adba8a724c297ca9481c6d463c2f0cf556bf61bc37e1af77cf7d6686                                                                                                                     
    image: docker.io/falcosecurity/falco-no-driver:0.38.1                                                                                                                                                          
    imageID: docker.io/falcosecurity/falco-no-driver@sha256:a59cadbaf556c05296dfc8f522786b2138404814797ffbc9ee3b26b336d06903                                                                                       
    lastState:                                                                                                                                                                                                     
      terminated:                                                                                                                                                                                                  
        containerID: containerd://9e7c69c0b51f9c8a014a35a1b2adfa11277fc3a188e65f04e0f09ef4c2238b9e                                                                                                                 
        exitCode: 137                                                                                                                                                                                              
        finishedAt: "2024-07-02T11:02:50Z"                                                                                                                                                                         
        reason: OOMKilled                                                                                                                                                                                          
        startedAt: "2024-07-02T10:37:23Z" 

I will test without the http_output.enabled=true.

I confirm the leak disappears once the http_output is disabled:

image

Thank you both very much! I will give it a look and report back :)

Out of curiosity, which libcurl version are you using? The bundled one or the system one?

EDIT: Anyway, i am able to reproduce by enabling http output

So, it seems like there is something wrong in the curl_easy_perform call here: https://github.com/falcosecurity/falco/blob/master/userspace/falco/outputs_http.cpp#L118
Since commenting it fixes the issue (well, http output does nothing then). I am still digging!

So, i tried to repro this with a minimal libcurl-only example but couldn't.
Then, i remembered that our outputs queue is unbounded by default and it means that it grows indefinitely; the rule you provided does not specify any syscall thus it matches every syscall/action made by the process called with args --lua-exec, that's why it generates so many output events.

TLDR: setting outputs_queue.capacity to eg: 100 in Falco config fixes the "issue".
But please mind that this is not an issue, it is by design behavior, exacerbated by the very wide condition of the rule.

Hi,
You are right, setting outputs_queue capacity in config resolves "my problem".
Didn't know why the memory was increasing, really thought it was a memory leak.

Thank you again for your help and sorry for the time spent on this matter.

No problem sir, thanks for asking!
/milestone 0.39.0

Hi @FedeDP , sry for bring this up, I met the same issue on 0.38.0, but may I know that if I set outputs_queue.capacity to some fixed value, does it mean falco will drop some events if cap is met? If yes, do we have some other options to mitigate this OOM issue?
The diff of our env is that we have many network traffic incoming/outcoming

does it mean falco will drop some events if cap is met

Yes, exactly.

If yes, do we have some other options to mitigate this OOM issue?

Unfortunately no; well if your system is generating too many events perhaps some rule is too noisy and must be stricter.

does it mean falco will drop some events if cap is met

Yes, exactly.

If yes, do we have some other options to mitigate this OOM issue?

Unfortunately no; well if your system is generating too many events perhaps some rule is too noisy and must be stricter.

got it thanks for answering. do we have any metrics we can use to monitor when a fixed value is chosen? i read https://falco.org/docs/metrics/falco-metrics/ but having a hard time to understand what metric's meaning actually, like falcosecurity_scap_n_retrieve_evts_drops_total and falcosecurity_scap_n_store_evts_drops_total, the difference between it and etc