Severe performance regression after 2.2.0 (debian bullseye -> bookworm)
Opened this issue · 14 comments
Describe the bug
Seeing high CPU usage and load on icinga2 master after upgrade to bookworm:
- 100% vs 20% CPU usage on two (virtual) cores
- Load about 130 vs less than 3 before
- Lots of check_ssl_cert checks time out
Bookworm should have version 2.60.0 of check_ssl_cert, but same issue with 2.70.
Using the old 2.2.0 script (with a small patch for the new "-m|--match" option) fixes the issue.
To Reproduce
- icinga master on two cores of
AMD EPYC 7313P
- 4GB memory (2GB swap) - not an issue, swap is completely free, almost 3GB "avail Mem" in top.
- about 300 SSL checks: mostly Active Directory Domain Controllers: IPv4+IPv6 (if present), LDAP + LDAPS, normal port + global catalog - quite a few variations for each host.
Expected behavior
Less CPU usage, no timeouts, ...
System (please complete the following information):
- OS: Debian
- OS version: bookworm (12)
- check_ssl_cert version: 2.60.0 and 2.70.0
- OpenSSL version (
openssl version
):OpenSSL 3.0.9 30 May 2023 (Library: OpenSSL 3.0.9 30 May 2023)
Additional context/output
Invocations look like this:
/usr/lib/nagios/plugins/check_ssl_cert --no-ssl3 --no-tls1 --no-tls1_1 -H $IPADDRESS -P ldap -c 14 -m $SERVERNAME -p 3268 -r /etc/ssl/trusted-cas.crt -w 30
Manual calls usually work, but are slower than before.
It could be caused by the check if the port is actually open.
Can you please post the output with the --debug-time
command line option?
Wow, quick response - thanks :)
First the output of /usr/bin/time
:
- production system (bullseye, more cores), patched v.2.0 script:
0.18user 0.04system 0:00.24elapsed 97%CPU (0avgtext+0avgdata 10900maxresident)k
0inputs+56outputs (6major+31538minor)pagefaults 0swaps
- qs system (bookworm), patched v2.2.0 script:
0.78user 0.05system 0:00.96elapsed 87%CPU (0avgtext+0avgdata 11556maxresident)k
0inputs+0outputs (1major+41023minor)pagefaults 0swaps
- qs system, v2.70.0 script:
1.61user 0.12system 0:02.72elapsed 63%CPU (0avgtext+0avgdata 11548maxresident)k
0inputs+0outputs (0major+96181minor)pagefaults 0swaps
With --debug-time
it runs for 4-5 seconds.
(I doubt that a "is the port open" check would cause high CPU usage - and I think the timeouts are only happening due to high CPU usage. Also as said before, I couldn't observe timeouts when I ran the check manually on CLI, even during high load.)
Full --debug-time
output (also has a timeout in the icinga2 history this morning, but not a DC):
check_ssl_cert_debug_time.txt
I didn't see anything suspicious (timing wise), only [DBG 0000s] '/usr/bin/openssl s_client' supports '-name': using icinga2
looks wrong (icinga2
happens to be the hostname of the system running the check)
I just installed bookworm and something is very strange: a standard check on github.com takes more than 32 seconds. On all the other Linux systems (Fedora, Debian) it takes less than 4 seconds.
I'll investigate further
Something to do with /bin/timeout
:
Example
corti@debian check_ssl_cert> time /bin/curl --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\"
real 0m0.010s
user 0m0.009s
sys 0m0.001s
corti@debian check_ssl_cert> time /bin/timeout 105 /bin/sh -c "/bin/curl --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\""
real 0m5.039s
user 0m0.006s
sys 0m0.010s
Nope ist 'curl'
corti@debian check_ssl_cert> time /bin/curl --user-agent 'check_ssl_cert/2.70.0' --location http://cacerts.digicert.com/DigiCertGlobalRootCA.crt --output /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 947 100 947 0 0 187 0 0:00:05 0:00:05 --:--:-- 198
real 0m5.052s
user 0m0.000s
sys 0m0.019s
It really seems a problem with curl:
corti@debian check_ssl_cert> time curl github.com
real 0m5.078s
user 0m0.030s
sys 0m0.017s
This takes around 10 ms on all other machines and systems I tested with.
As the problem is not related to the script, I'll close the issue.
I'm quite certain it isn't timeout/curl (should also be visible in the --debug-time
log):
# time /bin/curl --user-agent 'check_ssl_cert/2.70.0' --location http://cacerts.digicert.com/DigiCertGlobalRootCA.crt --output /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 947 100 947 0 0 55669 0 --:--:-- --:--:-- --:--:-- 59187
real 0m0.023s
user 0m0.007s
sys 0m0.000s
# time /bin/timeout 105 /bin/sh -c "/bin/curl --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\""
real 0m0.024s
user 0m0.007s
sys 0m0.003s
Given what I was seeing in top
I suspect there are more (expensive?) openssl calls than before.
$ time curl github.com
real 0m0.088s
user 0m0.010s
sys 0m0.022s
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Ok then it's only on my machine ...
And the issue is gone (as of today)
$ time curl github.com
real 0m0.060s
user 0m0.007s
sys 0m0.004s
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Given what I was seeing in
top
I suspect there are more (expensive?) openssl calls than before.
Could be, but should not have such a big impact
I looked at your output with the timing (total around 5 seconds). The only calls to OpenSSL that I see are:
- The main one with
openssl s_client
to fetch the certificate (cannot be skipped) - Three calls to
openssl x509
to check the three elements of the chain (these are just using local data and should not be expensive). You can try to skip these checks with--first-element-only
Do you get longer times only with LDAP checks or generally even with HTTPS?
Found some time to take a look at this again.
I built a wrapper script to measure openssl times, comparing debian bookworm openssl (OpenSSL 3.0.13) with a locally built OpenSSL 1.1.1w (in ~/src/openssl
, built with ./config
+ make -j8
).
#!/bin/sh
exec 9>/dev/tty
echo >&9 "TIME: openssl $@"
# exec /usr/bin/time -o /dev/fd/9 /usr/bin/openssl "$@"
LD_LIBRARY_PATH=~/src/openssl exec /usr/bin/time -o /dev/fd/9 ~/src/openssl/apps/openssl "$@"
- patched v2.2.0 script, openssl 1.1.1w: 0.29user
- patched v2.2.0 script, openssl 3.0.13: 0.81user
- HEAD (bf0addc), openssl 1.1.1w: 0.59user
- HEAD (bf0addc), openssl 3.0.13: 1.87user
It seems openssl x509 ...
takes at least 20ms or more with openssl 3.0.13, and "nothing" in openssl 1.1.1w, and the more recent version of check_ssl_certs extracts way more data from the certificates (and/or from more certificates?).
Setting SSL_CERT_FILE=
when extracting data with openssl helps a lot (my locally built openssl 1.1.1w tries to load a non-existing file from /usr/local); no idea why it loads the trusted CAs by default.