nhc_hw_gather_data() too slow on large core count
jpecar opened this issue · 5 comments
Hi,
we recently deployed some 2x 64c epyc servers with all 256 threads enabled. I was surprised to discover that nhc always times out on these machines.
With some poking around I ended up discovering that /proc/cpu parsing in nhc_hw_gather_data() takes 31.5s to finish. I tested it around on different machines and for me it takes 0.6s on 16c/32t node, 9.5s on 64c/128t node and as I said, 31.5s on 128c/256t node.
Funny enough, watchdog always kicks in on exactly 30s, no matter what I set TIMEOUT to. That's another thing I have to look into.
But for now, are there any pure bash options to speed up that loop? For now I violated the pure bash approach of nhc and replaced that whole loop with simple
HW_SOCKETS=$(lscpu -be | grep -v CPU | awk '{ print $3 }' | sort | uniq | wc -l)
HW_CORES=$(lscpu -be | grep -v CPU | awk '{ print $4 }' | sort | uniq | wc -l)
HW_THREADS=$(lscpu -be | grep -v CPU | awk '{ print $1 }' | sort | uniq | wc -l)
I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60
Mar 29 07:40:05 rtx-04 slurmd[1157807]: slurmd: debug: attempting to run health_check [/usr/local/bin/node_monitor]
Mar 29 07:40:05 rtx-04 systemd[1]: Starting system activity accounting tool...
Mar 29 07:40:05 rtx-04 systemd[1]: sysstat-collect.service: Succeeded.
Mar 29 07:40:05 rtx-04 systemd[1]: Started system activity accounting tool.
Mar 29 07:40:16 rtx-04 xinetd[3080]: START: nrpe pid=41501 from=::ffff:172.21.21.45
Mar 29 07:40:16 rtx-04 sudo[41505]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/ipmitool -I open sdr elist
Mar 29 07:40:18 rtx-04 xinetd[3080]: EXIT: nrpe status=0 pid=41501 duration=2(sec)
Mar 29 07:40:35 rtx-04 nhc[41547]: Health check failed: Script timed out while executing "check_ps_service -u root -S sshd".
This is nhc 1.4.2
Hi Paul!
I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60
@griznog is correct. I tried to explain it in the documentation, but it's easy to miss. :)
There are certain variables, of which TIMEOUT
is one, whose values get used by NHC prior to the execution of the instructions in nhc.conf
. In order to alter the values of such variables, the assignment must occur in one of 3 places:
- Settings placed in the system-global config file
/etc/sysconfig/nhc
are loaded very early in the execution process, so you can setTIMEOUT
here. One word of caution, though: this file affects all contexts of NHC, not just the default one. (If you don't use separate NHC contexts, you can ignore this part.) - Arbitrary variable settings can be specified on the NHC command line, so appending
TIMEOUT=60
to the end of thenhc
invocation (e.g.,nhc -a TIMEOUT=60
) will work too. - In the case of
TIMEOUT
in particular, there is a corresponding command line argument for setting this value, so you also have the option of appending-t 60
to your launch command.
Any of these 3 choices will allow you to set your desired 60-second timeout.
Thanks. I will change to set -t 60 in my nhc calls.