NLnetLabs/unbound

unbound kind of "loops" (100% CPU time) and is no longer reponsive

mistersixt opened this issue · 7 comments

perf-top-output
top-with-threads

Describe the bug
Since version 1.20.0 unbound starts using 100% CPU time after a few hours, sometimes even 200% and more, and the DNS answers take a looong time. After some further time DNS requests don't get answered at all any more. A regular "kill" does not stop the process, it needs to be "kill -9". Attached you can see the output of "top" and also from "perf top -p <-pid-of-unbound>".

The first "unbound" entries in "perf top" show "lruhash_lookup" and "rbtree_find_less_equal".

The amount of DNS requests is limited using iptables and hashlimit (150 per/minute with a burst of 45).

This behaviour is also with the current master source, 1.20.1 right now.

To reproduce
Steps to reproduce the behavior:

  1. see above.

Expected behavior
Unbound should be responsive all the time, and not looping after a few hours (like every 6 to 12 hours).

System:

  • Unbound version: 1.20.0 and 1.20.1 (current master)
  • OS: Debian 12 on ARM64
  • unbound -V output:

Version 1.20.1

Configure line: --with-libevent
Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.13 30 Jan 2024
Linked modules: dns64 respip validator iterator

Additional information
Add any other information that you may have gathered about the issue here.

It should not loop like that; I would like to know what unbound is looping over. The perf says this is entirely within libevent and some anonymous functions, I assume inlined in libevent.

Is it possible to get an ordinary stack trace, with like gstack <pid> , maybe several times to catch different parts of the loop? Likely the lruhash lookup and rbtree find results are from the other threads, perhaps that have ordinary cache responses and lookups. It would be nice to be able to reproduce the issue, but I have no clue what is the cause of it.

Hi,

after I had a very similar situation with Prosody (XMPP server) running on the very same server box using 100% CPU time after a while, printing "too many open files" into the error log, I increased the "nofiles" entry in /etc/security/limit.conf, and unbound as well as Prosody are running fine since (ulimit was showing 1024, increased it to 50.000).

Cannot tell whether this this is related, but there does seem to be a connection somehow.

Kind regards, mistersixt.

A similliar issue, with Flame Graph attached
unbound

Hi,

after I had a very similar situation with Prosody (XMPP server) running on the very same server box using 100% CPU time after a while, printing "too many open files" into the error log, I increased the "nofiles" entry in /etc/security/limit.conf, and unbound as well as Prosody are running fine since (ulimit was showing 1024, increased it to 50.000).

Cannot tell whether this this is related, but there does seem to be a connection somehow.

Kind regards, mistersixt.

Increasing the ulimit did not help by the way, I still see the looping unbound from time to time. Anything you need me to do while the process is looping?