Python API: unreachable nodes behaviour
pfrayer opened this issue ยท 4 comments
Hello ๐
With the clustershell
Python API, how can I programmatically know that a node is unreachable ?
Here are more details about what I tried:
Lets say I have 2 nodes, server1
and server2
. I access these nodes through SSH. server1
is running and reachable, server2
is stopped and unreachable.
With the clush
CLI, I can do this:
$ clush -w server[1-2] -- uname -r
server1: 5.8.10-arch1-1
clush: server2 exited with exit code 255
Now if I try to do the same inside a Python app:
#!/usr/bin/env python3
from ClusterShell.Task import NodeSet, task_self
task = task_self()
task.set_info('ssh_options', '-o StrictHostKeyChecking=no')
task.shell('uname -r', nodes=NodeSet('server1,server2'))
task.run()
print("Success :")
for output, nodelist in task.iter_buffers():
print('{} -> {}'.format(nodelist, output.message().decode()))
print("Error :")
for output, nodelist in task.iter_errors():
print('{} -> {}'.format(nodelist, output.message().decode()))
print("Timeout :")
for output, nodelist in task.iter_keys_timeout():
print('{} -> {}'.format(nodelist, output.message().decode()))
When executed, the output is the following :
Success :
['server2'] -> ssh: connect to host server2 port 22: No route to host
['server1'] -> 5.8.10-arch1-1
Error :
Timeout :
Here if I want to programmatically know if server2
is unreachable, I'll have to parse the output or to play with task.node_retcode(node)
.
Why isn't server2
listed in iter_errors()
or iter_keys_timeout()
? Are these 2 iterators dedicated to the command errors/timeouts, and not the connect errors/timeouts ?
As the documentation stated Iterate over error buffers
and Iterate over timed out keys
I was supposing the connect errors/timeouts would be in these iterators.
Is there any better way for me to programmatically know if a node is unreachable ? I am doing it wrong ?
Thanks for help :)
iter_errors()
returns the standard error output. Replace "Success" with "Stdout", "Error" with "Stderr" in your example.
Stderr and stdout is merged in stdout by default. See 'stderr' flag to separate them if this is what you want.
iter_retcodes()
is what you are looking for: https://clustershell.readthedocs.io/en/latest/api/Task.html#ClusterShell.Task.Task.iter_retcodes
Thanks @degremont , its clearer for me now.
One last question about iter_retcodes()
: what if the command I clush
has a legit 255
return code ? Ok, that would be a strange "legit" return code, but imagine.
Then if I check only iter_retcodes()
, I'm not able to distinguish this legit 255
return code from an unreachable SSH node which will also return 255
ClusterShell returns whatever ssh command returns. clush will be running under the hood:
ssh server2 uname -r
Difficult to make the difference between that and
ssh server1 bash -c 'exit 255'
If I were you I would only check for this kind of error code and timeout to consider nodes down.
You could parse ssh output to look for ssh specific error message, but that's started to be maybe overcomplex?
As you don't know if 255 is an SSH error or an app error, you could also, when 255 is returned, try to reconnect these hostlist and run a simple "true" command. If that still returns 255, that's the node is bad.
# run command
# for each node which returns 255
run "true"
if retcode is again 255, node is really down.
but that depends of your app, it could be a good or bad idea.
Thanks for this answer.
I was trying to avoid using several iterators (iter_buffers
because I need commands stdout, iter_retcodes
for unreachable nodes or errors etc.), but I can deal with it using function like task.node_retcode(node)
for instance, and then use only iter_buffers
:
for output, nodelist in task.iter_buffers():
message = output.message().decode()
for node in nodelist:
if task.node_retcode(node) != 255:
<do stuff with node / message>
else:
<manage error>
I close this issue, as things are much clearer for me now ๐