matteocorti/check_ssl_cert

Severe performance regression after 2.2.0 (debian bullseye -> bookworm)

Opened this issue · 14 comments

Describe the bug

Seeing high CPU usage and load on icinga2 master after upgrade to bookworm:

  • 100% vs 20% CPU usage on two (virtual) cores
  • Load about 130 vs less than 3 before
  • Lots of check_ssl_cert checks time out

Bookworm should have version 2.60.0 of check_ssl_cert, but same issue with 2.70.

Using the old 2.2.0 script (with a small patch for the new "-m|--match" option) fixes the issue.

To Reproduce

  • icinga master on two cores of AMD EPYC 7313P
  • 4GB memory (2GB swap) - not an issue, swap is completely free, almost 3GB "avail Mem" in top.
  • about 300 SSL checks: mostly Active Directory Domain Controllers: IPv4+IPv6 (if present), LDAP + LDAPS, normal port + global catalog - quite a few variations for each host.

Expected behavior

Less CPU usage, no timeouts, ...

System (please complete the following information):

  • OS: Debian
  • OS version: bookworm (12)
  • check_ssl_cert version: 2.60.0 and 2.70.0
  • OpenSSL version (openssl version): OpenSSL 3.0.9 30 May 2023 (Library: OpenSSL 3.0.9 30 May 2023)

Additional context/output

Invocations look like this:

/usr/lib/nagios/plugins/check_ssl_cert --no-ssl3 --no-tls1 --no-tls1_1 -H $IPADDRESS -P ldap -c 14 -m $SERVERNAME -p 3268 -r /etc/ssl/trusted-cas.crt -w 30

Manual calls usually work, but are slower than before.

It could be caused by the check if the port is actually open.
Can you please post the output with the --debug-time command line option?

Wow, quick response - thanks :)

First the output of /usr/bin/time:

  • production system (bullseye, more cores), patched v.2.0 script:
    0.18user 0.04system 0:00.24elapsed 97%CPU (0avgtext+0avgdata 10900maxresident)k
    0inputs+56outputs (6major+31538minor)pagefaults 0swaps
  • qs system (bookworm), patched v2.2.0 script:
    0.78user 0.05system 0:00.96elapsed 87%CPU (0avgtext+0avgdata 11556maxresident)k
    0inputs+0outputs (1major+41023minor)pagefaults 0swaps
  • qs system, v2.70.0 script:
    1.61user 0.12system 0:02.72elapsed 63%CPU (0avgtext+0avgdata 11548maxresident)k
    0inputs+0outputs (0major+96181minor)pagefaults 0swaps

With --debug-time it runs for 4-5 seconds.
(I doubt that a "is the port open" check would cause high CPU usage - and I think the timeouts are only happening due to high CPU usage. Also as said before, I couldn't observe timeouts when I ran the check manually on CLI, even during high load.)

Full --debug-time output (also has a timeout in the icinga2 history this morning, but not a DC):
check_ssl_cert_debug_time.txt

I didn't see anything suspicious (timing wise), only [DBG 0000s] '/usr/bin/openssl s_client' supports '-name': using icinga2 looks wrong (icinga2 happens to be the hostname of the system running the check)

I just installed bookworm and something is very strange: a standard check on github.com takes more than 32 seconds. On all the other Linux systems (Fedora, Debian) it takes less than 4 seconds.

I'll investigate further

Something to do with /bin/timeout:

Example

corti@debian check_ssl_cert> time /bin/curl    --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\"

real	0m0.010s
user	0m0.009s
sys	0m0.001s
corti@debian check_ssl_cert> time /bin/timeout 105 /bin/sh -c "/bin/curl    --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\""

real	0m5.039s
user	0m0.006s
sys	0m0.010s

Nope ist 'curl'

corti@debian check_ssl_cert> time /bin/curl --user-agent 'check_ssl_cert/2.70.0' --location http://cacerts.digicert.com/DigiCertGlobalRootCA.crt --output /dev/null 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   947  100   947    0     0    187      0  0:00:05  0:00:05 --:--:--   198

real	0m5.052s
user	0m0.000s
sys	0m0.019s

It really seems a problem with curl:

corti@debian check_ssl_cert> time curl github.com

real	0m5.078s
user	0m0.030s
sys	0m0.017s

This takes around 10 ms on all other machines and systems I tested with.

As the problem is not related to the script, I'll close the issue.

I'm quite certain it isn't timeout/curl (should also be visible in the --debug-time log):

# time /bin/curl --user-agent 'check_ssl_cert/2.70.0' --location http://cacerts.digicert.com/DigiCertGlobalRootCA.crt --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   947  100   947    0     0  55669      0 --:--:-- --:--:-- --:--:-- 59187

real    0m0.023s
user    0m0.007s
sys     0m0.000s
# time /bin/timeout 105 /bin/sh -c "/bin/curl    --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\""

real    0m0.024s
user    0m0.007s
sys     0m0.003s

Given what I was seeing in top I suspect there are more (expensive?) openssl calls than before.

waja commented
$ time curl github.com

real	0m0.088s
user	0m0.010s
sys	0m0.022s
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Ok then it's only on my machine ...

And the issue is gone (as of today)

$ time curl github.com

real	0m0.060s
user	0m0.007s
sys	0m0.004s
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Given what I was seeing in top I suspect there are more (expensive?) openssl calls than before.

Could be, but should not have such a big impact

I looked at your output with the timing (total around 5 seconds). The only calls to OpenSSL that I see are:

  1. The main one with openssl s_client to fetch the certificate (cannot be skipped)
  2. Three calls to openssl x509 to check the three elements of the chain (these are just using local data and should not be expensive). You can try to skip these checks with --first-element-only

Do you get longer times only with LDAP checks or generally even with HTTPS?

Found some time to take a look at this again.

I built a wrapper script to measure openssl times, comparing debian bookworm openssl (OpenSSL 3.0.13) with a locally built OpenSSL 1.1.1w (in ~/src/openssl, built with ./config + make -j8).

#!/bin/sh

exec 9>/dev/tty
echo >&9 "TIME: openssl $@"
# exec /usr/bin/time -o /dev/fd/9 /usr/bin/openssl "$@"
LD_LIBRARY_PATH=~/src/openssl exec /usr/bin/time -o /dev/fd/9 ~/src/openssl/apps/openssl "$@"
  • patched v2.2.0 script, openssl 1.1.1w: 0.29user
  • patched v2.2.0 script, openssl 3.0.13: 0.81user
  • HEAD (bf0addc), openssl 1.1.1w: 0.59user
  • HEAD (bf0addc), openssl 3.0.13: 1.87user

It seems openssl x509 ... takes at least 20ms or more with openssl 3.0.13, and "nothing" in openssl 1.1.1w, and the more recent version of check_ssl_certs extracts way more data from the certificates (and/or from more certificates?).

Setting SSL_CERT_FILE= when extracting data with openssl helps a lot (my locally built openssl 1.1.1w tries to load a non-existing file from /usr/local); no idea why it loads the trusted CAs by default.