Huge Cpu on deleting/updating decissions

Question

Huge Cpu on deleting/updating decissions

ne20002 opened this issue a year ago · 20 comments

ne20002 commented a year ago

I give the 0.0.27 a new try on my BPi-R3.

It seems as if updating the ipset or deleting ips from the ipset is using relativly huge amount of cpu.

root@BPiR3:/etc/config# cat /var/log/crowdsec-firewall-bouncer.log 
time="16-08-2023 18:05:21" level=info msg="backend type : nftables"
time="16-08-2023 18:05:21" level=info msg="nftables initiated"
time="16-08-2023 18:05:21" level=info msg="Using API key auth"
time="16-08-2023 18:05:21" level=info msg="Processing new and deleted decisions . . ."
time="16-08-2023 18:05:30" level=info msg="1250 decisions deleted"
time="16-08-2023 18:05:30" level=info msg="36551 decisions added"
time="16-08-2023 18:11:17" level=info msg="2 decisions deleted"
time="16-08-2023 18:15:22" level=info msg="2 decisions deleted"
time="16-08-2023 18:15:46" level=info msg="2 decisions deleted"
time="16-08-2023 18:20:30" level=info msg="1 decision added"
time="16-08-2023 18:26:31" level=info msg="1 decision deleted"
time="16-08-2023 18:27:29" level=info msg="2 decisions deleted"

And this is the rrd output:

First spikes at 20:05 (aka 18:05), second spikes at 20:10 relate to deleting just two ips from set. Increase of thermal verifies the cpu usage spikes.

System: BananaPi R3 with Openwrt 25.03-rc2 and Openwrt crowdsec-firewall-bouncer package 0.0.27.

The system is handling 20Mbit/s traffic (IPTV) with just 1% cpu for the time displayed. With starting the bouncer the increase in cpu and temperature started.

Maybe there is some room for optimizations?

Adding bigger amounts of ips (like another 16033 decissions) is creating a smaller spike. So I wonder if there are two different ways used for adding small or bigger portions of ips.

Answer 1 · 2023-08-17T07:48:18.000Z

Maybe there is some room for optimizations?

Do you have any recommendations? we already split into smaller batches only get IP based decisions. Any feedback you can provide we can explore.

EDIT: are you subscribed to any third party blocklists via the console?

Edit edit: I realised that the chunking optimisation is version 0.0.28 and I saw your issue with armv8 guessing you can test it without that version?

So I have been doing some testing for a potential bug in crowdsec but at the same time thought I test nftables. I see most of CPU usage is originating from the nft binary that is called when inserting/updating/deleting so at this moment it might be the way nftables handles the update causing the spike.

Update: turning off Prometheus seems to solve the cpu usages for me

Update update: it seems the metrics inspects the chains which can provide a massive json payload and doing this every 10 seconds was the problem for me..... hmmmm

Answer 2 · 2023-08-17T14:59:25.000Z

Maybe we should do #314 for sanity

Answer 3 · 2023-08-17T17:48:05.000Z

Hi
yes, I wanted the ARMv8 so I can test with my BananaPi R3. I currently have 0.0.27 running. If there is optimizations in 0.0.28 I'd be happy to test it. :)
And I have one ip list subscribed: tor exit nodes.

Regarding the metrics, this is my config.yaml:

root@BPiR3:~# cat /var/etc/crowdsec/bouncers/crowdsec-firewall-bouncer.yaml
mode: nftables
pid_dir: /var/run/
update_frequency: 10s
daemonize: true
log_mode: file
log_dir: /var/log/
log_level: info
log_compression: true
log_max_size: 100
log_max_backups: 3
log_max_age: 30
api_url: http://cs-lapi:8014/
api_key: *****************************
insecure_skip_verify: true
disable_ipv6: false
deny_action: drop
deny_log: false
supported_decisions_type:
  - ban
#to change log prefix
deny_log_prefix: "crowdsec: "
#to change the blacklists name
blacklists_ipv4: crowdsec-blacklists
blacklists_ipv6: crowdsec6-blacklists
#type of ipset to use
ipset_type: nethash
#if present, insert rule in those chains
iptables_chains:
  - INPUT
#  - FORWARD
#  - DOCKER-USER
## nftables
nftables:
  ipv4:
    enabled: true
    set-only: true
    table: crowdsec
    chain: input
  ipv6:
    enabled: true
    set-only: true
    table: crowdsec6
    chain: input
# packet filter
pf:
  # an empty disables the anchor
  anchor_name: ""
prometheus:
  enabled: false
  listen_addr: 127.0.0.1
  listen_port: 60601

Answer 4 · 2023-08-21T13:42:04.000Z

Hi @ne20002 ,

as I said in the other ticket the arm64 version should work for v8, I don't recall specific changes for performance in nftables 0.0.27->0.0.28 but there was a refactoring and dependency updates so I'd be interested to know how it works for you.

Answer 5 · 2023-08-21T13:50:03.000Z

nevermind, I see now the note about prometheus. The next -rc4 will disable the metrics by default as well

Answer 6 · 2023-08-21T13:51:14.000Z

Their configuration already had it disabled so it wouldnt be a driving factor, however, I am also interested in the results from 27 -> 28

Answer 7 · 2023-08-21T16:25:28.000Z

I'll give it a try tomorrow.
Also, Openwrt 23.05.0-rc3 is announced with a nftables update.

Answer 8 · 2023-08-22T08:18:41.000Z

For those interessted in perfomrmace info, this is a BananaPi R3 with crowdsec-firewall-bouncer 0.0.27 Openwrt package on Openwrt 23.05.0-rc2, info for the last two hours:

time="22-08-2023 06:09:14" level=info msg="1 decision added"
time="22-08-2023 06:12:19" level=info msg="1 decision deleted"
time="22-08-2023 06:12:47" level=info msg="1 decision deleted"
time="22-08-2023 06:13:17" level=info msg="1 decision deleted"
time="22-08-2023 06:34:33" level=info msg="26 decisions deleted"
time="22-08-2023 06:34:57" level=info msg="16105 decisions added"
time="22-08-2023 06:37:28" level=info msg="1 decision deleted"
time="22-08-2023 06:38:28" level=info msg="1 decision deleted"
time="22-08-2023 07:08:05" level=info msg="1 decision added"
time="22-08-2023 07:08:52" level=info msg="1 decision deleted"
time="22-08-2023 07:34:37" level=info msg="failed to flush chunk of 25 elements, will retry each one: conn.Receive: netlink receive: no such file or directory"
time="22-08-2023 07:34:47" level=info msg="25 decisions deleted"
time="22-08-2023 07:54:20" level=info msg="1 decision added"
time="22-08-2023 07:57:13" level=info msg="1 decision added"
time="22-08-2023 08:01:01" level=info msg="1 decision added"

Time is 2h offest due to timezone.

Answer 9 · 2023-08-22T10:34:31.000Z

And this is now with the 0.0.28-rc3:

time="22-08-2023 08:41:20" level=info msg="Starting crowdsec-firewall-bouncer v0.0.28-rc3-5afbe527ad361e578fb1bdf523a2c34c31258149"
time="22-08-2023 08:41:20" level=info msg="backend type : nftables"
time="22-08-2023 08:41:20" level=info msg="nftables initiated"
time="22-08-2023 08:41:20" level=info msg="Using API key auth"
time="22-08-2023 08:41:20" level=info msg="Processing new and deleted decisions . . ."
time="22-08-2023 08:41:35" level=info msg="1110 decisions deleted"
time="22-08-2023 08:41:42" level=info msg="38361 decisions added"
time="22-08-2023 08:50:57" level=info msg="1 decision added"
time="22-08-2023 08:57:08" level=info msg="1 decision deleted"
time="22-08-2023 08:58:23" level=info msg="1 decision added"
time="22-08-2023 09:07:45" level=info msg="1 decision added"
time="22-08-2023 09:18:01" level=info msg="1 decision deleted"
time="22-08-2023 09:29:28" level=info msg="1 decision deleted"
time="22-08-2023 09:34:35" level=info msg="failed to flush chunk of 15 elements, will retry each one: conn.Receive: netlink receive: no such file or directory"
time="22-08-2023 09:34:55" level=info msg="15 decisions deleted"
time="22-08-2023 09:40:11" level=info msg="1 decision deleted"
time="22-08-2023 09:51:04" level=info msg="1 decision deleted"
time="22-08-2023 09:57:34" level=info msg="1 decision deleted"
time="22-08-2023 10:00:14" level=info msg="1 decision added"
time="22-08-2023 10:17:28" level=info msg="1 decision added"
time="22-08-2023 10:23:07" level=info msg="1 decision added"

Answer 10 · 2023-08-22T12:50:43.000Z

Hi @ne20002

thanks for the testing. It looks like there is a regression but I can't justify it from the changes between the two versions. The biggest change was from rc1 and rc2, where we updated the google/nftables library. We should see if there's a difference in behavior between these.

We are preparing a test bench but don't have machines with 2GB, the smallest is a Pi4 with 4GB.

Answer 11 · 2023-08-22T13:43:05.000Z

For what I see is that adding a huge number of ips to a (intially empty) set does not seem to consume more processing than just adding or deleting single ips.
I also see the bigger spike at 10:57 (and 10:58). According to the log there is two insert/deletions of a single ip near together. Is it possible that there are update operations on the set done in parallel causing extra processing overhaed?

Answer 12 · 2023-08-22T17:38:08.000Z

Maybe this one shows it more clear:

time="22-08-2023 15:34:47" level=info msg="13 decisions deleted"
time="22-08-2023 16:17:07" level=info msg="1 decision added"
time="22-08-2023 16:34:33" level=info msg="20 decisions deleted"
time="22-08-2023 16:34:40" level=info msg="14994 decisions added"
time="22-08-2023 17:17:07" level=info msg="1 decision deleted"

The update of more decissions causes less processor usage. It's kinda interessting ..

Answer 13 · 2023-08-24T08:21:44.000Z

Hi @ne20002

I have a crowdsec binary that fixes the issue for me, raising the batch size for decision operations. I can send you the link to test. Armv8, right?

Then if I understand correctly you are the maintainer for the openwrt port, there are a couple of changes required for building v1.5.3 that we should discuss, but not in this repository

Answer 14 · 2023-08-24T09:42:24.000Z

https://s3.eu-west-1.amazonaws.com/crowdsec-1.5.3-test/crowdsec-arm64
https://s3.eu-west-1.amazonaws.com/crowdsec-1.5.3-test/cscli-arm64

Answer 15 · 2023-09-11T17:59:49.000Z

As written on discourse.crowdsec.net I don't run the local api on Arm but in podman pod on amd64 with image from hub.docker.com.
Shall I use a dev image to test this?

Answer 16 · 2023-09-12T19:24:43.000Z

Hi!

No need to test anymore, the changes have been merged for 1.5.3 which is out next week. So if it improves performance - good, otherwise we'll see for 1.5.4

Answer 17 · 2023-09-22T09:06:27.000Z

Hi @mmetc
I'm now running 1.5.4 and 0.28-rc5 but the cpu spikes have only slightly reduced. Anyway, it is not really a problem (doing processing is consuming resources though).

I also still have a few entries in log like this:

time="22-09-2023 07:53:41" level=info msg="failed to flush chunk of 20 elements, will retry each one: conn.Receive: netlink receive: no such file or directory"

As I understand this happens when you try to delete entries which are already purged by nftables as all those entries are added with a ttl. I just had a thought of maybe switching from removing an ip entry to a (re)add the entry with a really small ttl (like one second) way to get rid of the warnings (and the need for a second processing doing then one by one).
It depends on the idea that if an entry is in the set with a higher ttl the ttl will be reduced when you readd it with a smaller ttl.
Maybe it's worth a try?

Answer 18 · 2023-09-22T12:11:09.000Z

Hi @ne20002! I'm now running 1.5.4 and 0.28-rc5 but the cpu spikes have only slightly

reduced. Anyway, it is not really a problem (doing processing is consuming resources though).

Remind me how often they happen, every 10 minutes that's it? And no crowdsec process is running, only bouncer

I also still have a few entries in log like this: time="22-09-2023 07:53:41" level=info msg="failed to flush chunk of 20 elements, will retry each one: conn.Receive: netlink receive: no such file or directory" As I understand this happens when you try to delete entries which are already purged by nftables as all those entries are added with a ttl. I just had a thought of maybe switching from removing an ip entry to (re)add the entry with a really small ttl (like one second) way get rid of the warnings (and the need for a second processing doing then one by one). It depends on the idea that if an entry is in the set with a higher ttl the ttl will be reduced when you readd it with a smaller ttl. But may it's worth a try?

That's exactly what happens, but I think if everything else works, the best way to remove the message is to change the level to debug() , I'll think about it, thanks

…

Message ID: <crowdsecurity/cs-firewall-bouncer/issues/316/1731076685@ github.com>

Answer 19 · 2023-09-23T13:02:36.000Z

That's exactly what happens,

Really? You are actually not removing entries from the set but already readd them with small ttl? Why would that sometimes fail and require a one by one second try?

Answer 20 · 2023-09-23T21:22:42.000Z

On Sat, Sep 23, 2023 at 3:02 PM ne20002 ***@***.***> wrote: That's exactly what happens, Really? You are actually not removing entries from the set but already readd them with small ttl? Why would that fail and require a one by one second try?

No sorry, I meant you were correct for the cause of the error. I don't think the message to the user is necessary, that's all.

…

Message ID: <crowdsecurity/cs-firewall-bouncer/issues/316/1732313821@ github.com>