SUSE/doc-hpc

Is the explanation of down and down* correct?

Opened this issue · 4 comments

@mslacken @e4t Could you check whether this FAQ entry is correct?

Current version

doc-hpc/xml/slurm.xml

Lines 1046 to 1058 in 6b585df

<para> What is the difference between the state <literal>down</literal>
and <literal>down*</literal>? </para>
</question>
<answer>
<para> A <literal>*</literal> shown after a status code means that the
node is not responding. </para>
<para> Thus, when a node is marked as <literal>down*</literal>, this means
that the node is not reachable due to network issues, or its
<literal>slurmd</literal> is not running. </para>
<para> In the <literal>down</literal> state, the node is reachable, but
the node was rebooted unexpectedly, the hardware does not match the
description in <filename>slurm.conf</filename>, or a healthcheck was
configured with the <literal>HealthCheckProgram</literal>. </para>

Original version which explained "down" twice

doc-hpc/xml/slurm.xml

Lines 1327 to 1333 in 38b3183

<para>What is the difference between the state <literal>down</literal> and <literal>down*</literal>?</para>
</question>
<answer>
<para>
When a node is marked as <literal>down</literal> this means that the node is not reachable due to network issues or the <literal>slurmd</literal> is not running. In the <literal>down</literal> state the node is reachable, but the node was rebooted unexpectedly, the hardware does not match the description in <filename>slurm.conf</filename> or a healthcheck configured with the <literal>HealthCheckProgram</literal>.
</para>
</answer>

The FAQ entry is correct

Thank you! Closing this then.

e4t commented

There are still some issues in there:

  • 'this mean that the node is not reachable due to network issues' this -> it
  • 'or its slurmd is not running' -> 'or slurmd on this node is not running'
  • 'In the down state, the node is reachable, but the node was rebooted unexpectedly, the hardware does not match the description in slurm.conf, or a healthcheck was configured with the HealthCheckProgram.'
    but either the node was rebooted ... also: healthcheck -> health check

Fixed in 8e4eefc