mej/nhc

nhc_hw_gather_data() too slow on large core count

jpecar opened this issue · 5 comments

Hi,
we recently deployed some 2x 64c epyc servers with all 256 threads enabled. I was surprised to discover that nhc always times out on these machines.
With some poking around I ended up discovering that /proc/cpu parsing in nhc_hw_gather_data() takes 31.5s to finish. I tested it around on different machines and for me it takes 0.6s on 16c/32t node, 9.5s on 64c/128t node and as I said, 31.5s on 128c/256t node.
Funny enough, watchdog always kicks in on exactly 30s, no matter what I set TIMEOUT to. That's another thing I have to look into.
But for now, are there any pure bash options to speed up that loop? For now I violated the pure bash approach of nhc and replaced that whole loop with simple

HW_SOCKETS=$(lscpu -be | grep -v CPU | awk '{ print $3 }' | sort | uniq | wc -l)
HW_CORES=$(lscpu -be | grep -v CPU | awk '{ print $4 }' | sort | uniq | wc -l)
HW_THREADS=$(lscpu -be | grep -v CPU | awk '{ print $1 }' | sort | uniq | wc -l)

I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60

Mar 29 07:40:05 rtx-04 slurmd[1157807]: slurmd: debug: attempting to run health_check [/usr/local/bin/node_monitor]
Mar 29 07:40:05 rtx-04 systemd[1]: Starting system activity accounting tool...
Mar 29 07:40:05 rtx-04 systemd[1]: sysstat-collect.service: Succeeded.
Mar 29 07:40:05 rtx-04 systemd[1]: Started system activity accounting tool.
Mar 29 07:40:16 rtx-04 xinetd[3080]: START: nrpe pid=41501 from=::ffff:172.21.21.45
Mar 29 07:40:16 rtx-04 sudo[41505]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/ipmitool -I open sdr elist
Mar 29 07:40:18 rtx-04 xinetd[3080]: EXIT: nrpe status=0 pid=41501 duration=2(sec)
Mar 29 07:40:35 rtx-04 nhc[41547]: Health check failed: Script timed out while executing "check_ps_service -u root -S sshd".

This is nhc 1.4.2

mej commented

Hi Paul!

I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60

@griznog is correct. I tried to explain it in the documentation, but it's easy to miss. :)

There are certain variables, of which TIMEOUT is one, whose values get used by NHC prior to the execution of the instructions in nhc.conf. In order to alter the values of such variables, the assignment must occur in one of 3 places:

  • Settings placed in the system-global config file /etc/sysconfig/nhc are loaded very early in the execution process, so you can set TIMEOUT here. One word of caution, though: this file affects all contexts of NHC, not just the default one. (If you don't use separate NHC contexts, you can ignore this part.)
  • Arbitrary variable settings can be specified on the NHC command line, so appending TIMEOUT=60 to the end of the nhc invocation (e.g., nhc -a TIMEOUT=60) will work too.
  • In the case of TIMEOUT in particular, there is a corresponding command line argument for setting this value, so you also have the option of appending -t 60 to your launch command.

Any of these 3 choices will allow you to set your desired 60-second timeout.

Thanks. I will change to set -t 60 in my nhc calls.

mej commented

Based on testing and feedback, #121 has addressed this issue sufficiently to warrant its closure; however, if your own testing or deployment experience(s) differ, please do reopen this one, or a new one, at your discretion! 😃