cea-hpc/clustershell

Python API: unreachable nodes behaviour

pfrayer opened this issue ยท 4 comments

Hello ๐Ÿ‘‹

With the clustershell Python API, how can I programmatically know that a node is unreachable ?


Here are more details about what I tried:

Lets say I have 2 nodes, server1 and server2. I access these nodes through SSH. server1 is running and reachable, server2 is stopped and unreachable.
With the clush CLI, I can do this:

$ clush -w server[1-2] -- uname -r
server1: 5.8.10-arch1-1
clush: server2 exited with exit code 255

Now if I try to do the same inside a Python app:

#!/usr/bin/env python3
from ClusterShell.Task import NodeSet, task_self


task = task_self()
task.set_info('ssh_options', '-o StrictHostKeyChecking=no')

task.shell('uname -r', nodes=NodeSet('server1,server2'))
task.run()

print("Success :")
for output, nodelist in task.iter_buffers():
    print('{} -> {}'.format(nodelist, output.message().decode()))
print("Error :")
for output, nodelist in task.iter_errors():
    print('{} -> {}'.format(nodelist, output.message().decode()))
print("Timeout :")
for output, nodelist in task.iter_keys_timeout():
    print('{} -> {}'.format(nodelist, output.message().decode()))

When executed, the output is the following :

Success :
['server2'] -> ssh: connect to host server2 port 22: No route to host
['server1'] -> 5.8.10-arch1-1
Error :
Timeout :

Here if I want to programmatically know if server2 is unreachable, I'll have to parse the output or to play with task.node_retcode(node).


Why isn't server2 listed in iter_errors() or iter_keys_timeout() ? Are these 2 iterators dedicated to the command errors/timeouts, and not the connect errors/timeouts ?
As the documentation stated Iterate over error buffers and Iterate over timed out keys I was supposing the connect errors/timeouts would be in these iterators.

Is there any better way for me to programmatically know if a node is unreachable ? I am doing it wrong ?

Thanks for help :)

iter_errors() returns the standard error output. Replace "Success" with "Stdout", "Error" with "Stderr" in your example.

Stderr and stdout is merged in stdout by default. See 'stderr' flag to separate them if this is what you want.

iter_retcodes() is what you are looking for: https://clustershell.readthedocs.io/en/latest/api/Task.html#ClusterShell.Task.Task.iter_retcodes

Thanks @degremont , its clearer for me now.

One last question about iter_retcodes() : what if the command I clush has a legit 255 return code ? Ok, that would be a strange "legit" return code, but imagine.

Then if I check only iter_retcodes(), I'm not able to distinguish this legit 255 return code from an unreachable SSH node which will also return 255

ClusterShell returns whatever ssh command returns. clush will be running under the hood:

ssh server2 uname -r

Difficult to make the difference between that and

ssh server1 bash -c 'exit 255'

If I were you I would only check for this kind of error code and timeout to consider nodes down.
You could parse ssh output to look for ssh specific error message, but that's started to be maybe overcomplex?

As you don't know if 255 is an SSH error or an app error, you could also, when 255 is returned, try to reconnect these hostlist and run a simple "true" command. If that still returns 255, that's the node is bad.

# run command
# for each node which returns 255
    run "true"
    if retcode is again 255, node is really down.

but that depends of your app, it could be a good or bad idea.

Thanks for this answer.

I was trying to avoid using several iterators (iter_buffers because I need commands stdout, iter_retcodes for unreachable nodes or errors etc.), but I can deal with it using function like task.node_retcode(node) for instance, and then use only iter_buffers:

for output, nodelist in task.iter_buffers():
    message = output.message().decode()
    for node in nodelist:
        if task.node_retcode(node) != 255:
            <do stuff with node / message>
        else:
            <manage error>

I close this issue, as things are much clearer for me now ๐Ÿ‘