mej/nhc

NHC must understand the Slurm node state "resv" (Reserved)

OleHolmNielsen opened this issue · 4 comments

We're installing some new nodes in our Slurm cluster and their fabric cables are not yet in place, so the Node Health Check (NHC) gives an error as expected:

[root@b001 ~]# nhc
ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).

However, because we have temporarily set the Slurm state of these nodes to "resv" (Reserved), some warning messages are printed in /var/log/nhc.log:

ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
20190409 13:20:33 /usr/libexec/nhc/node-mark-offline b001 check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
/usr/libexec/nhc/node-mark-offline: Not sure how to handle node state "resv" on b001

I would like to request the addition of Slurm state "resv" to the /usr/libexec/nhc/node-mark-offline script as in this diff:

--- /usr/libexec/nhc/node-mark-offline.orig 2015-11-11 22:46:52.000000000 +0100
+++ /usr/libexec/nhc/node-mark-offline 2019-04-09 13:29:48.587902690 +0200
@@ -63,7 +63,7 @@
OLD_NOTE_LEADER="${LINE[1]}"
OLD_NOTE="${LINE[*]:2}"
case "$STATUS" in

  •    alloc*|comp*|drain*|drng*|fail*|idle*|maint*|mix*|resume*|undrain*)
    
  •    resv*|alloc*|comp*|drain*|drng*|fail*|idle*|maint*|mix*|resume*|undrain*)
           case "$STATUS" in
               drain*|drng*|fail*|maint*)
                   # If the node is already offline, and there is no old note, and
    

With this change I do get the expected behavior of NHC, and the nhc.log shows:

ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
20190409 13:29:51 /usr/libexec/nhc/node-mark-offline b001 check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
/usr/libexec/nhc/node-mark-offline: Marking resv b001 offline: NHC: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).

See also this Slurm bug report: https://bugs.schedmd.com/show_bug.cgi?id=6816

Thanks,
Ole

I agree with this. This is similar to what we are doing in #81 - expanding the Slurm states understood by NHC.

I think you'll also want to add resv to node-mark-online as well, or else NHC won't know how to put it back online. I haven't tested this yet, but I had to do the same thing in #81.

One final thought: maybe resv should be added to the end of the states checked, rather than the beginning, to better reflect that it was added on later in the history of the code (and because it's a more rare state).

src/common/slurm_protocol_defs.c -> node_state_string_compact() shows all the possible node states that NHC could see. There are still a few possible states not handled by NHC, including POW_UP, POW_DN, POWRNG_DN (new in 19.05), CANC_R (very rare), and NPC. Also, state modifiers like $, @, #, %, ~, and * should be taken into account where necessary. See #36 for an example of this.

Hi Michael, thanks a lot for your thorough work! I hope this will make it into the next release of NHC.

mej commented

This issue was fixed by #17 (commit e839b60) and merged via 8aa5575. The fix will be included in NHC 1.4.3 when released.