cea-hpc/clustershell

problems on using a topology

BerndKrischok opened this issue · 3 comments

when trying to contact nodes via a topology configuration I have problems especially with a deeper nested topology.
While a flat topology seems to work.
Example topology file 'topo':
[routes]
n061901: n061902,n062001
n061902: n062002,n143501
n062001: n183601,n192201
n062002: n193201,n072602
n143501: n072701,n072702
n183601: n072801,n072802
n192201: n072901,n072902

N=n061901,n061902,n062001,n061902,n062002,n143501,n062001,n183601,n192201,n062002,n193201,n072602,n143501,n072701,n072702,n183601,n072801,n072802,n192201,n072901,n072902
clush -w $N --topology=topo hostname
^Cn061902: n061902
n061901: n061901
n062001: n062001
n062002: n062002
n143501: n143501
n192201: n192201
n183601: n183601
Keyboard interrupt.

(the interrupt is needed because it hangs for ever)

The tree in the debug looks like this:

n061901
|- n061902
| |- n062002
| | - n[072602,193201] | - n143501
| - n[072701-072702] - n062001
|- n183601
| - n[072801-072802] - n192201
`- n[072901-072902]

Maybe I am doing something wrong. Has anyone an idea what is going wrong?
(every node listed above has been checked for ssh connection to each other)

With a more flat topology I see no problems:
[routes]
n061901: n061902,n062001,n062002
n061902: n143501,n183601,n192201,n193201
n062001: n072801,n072802,n072901,n072902
n062002: n072602,n072701,n072702

clush -w $N --topology=topo hostname
n061902: n061902
n061901: n061901
n062002: n062002
n062001: n062001
n193201: n193201
n192201: n192201
n072602: n072602
n072701: n072701
n072702: n072702
n183601: n183601
n143501: n143501
n072902: n072902
n072801: n072801
n072901: n072901
n072802: n072802

Thank you
Bernd

I can reproduce this.

Just changing node names to have something I understand

$ clush -d -w d[1-4] hostname
DEBUG:root:clush: STARTING DEBUG
Changing max open files soft limit from 1024 to 8192
User interaction: True
Create STDIN worker: False
clush: enabling tree topology (6 gateways)
clush: nodeset=d[1-4] fanout=64 [timeout conn=15.0 cmd=0.0] command="hostname"
---------------
rootnode
|- a1
|  |- b1
|  |  `- d[1-2]
|  `- b2
|     `- d[3-4]
`- a2
   |- b3
   |  `- d[5-6]
   `- b4
      `- d[7-8]
---------------
DEBUG:ClusterShell.Worker.Tree:stderr=True
DEBUG:ClusterShell.Worker.Tree:TreeWorker._launch on d[1-4] (fanout=64)
DEBUG:ClusterShell.Worker.Tree:next_hops=[('a1', 'd[1-4]')]
DEBUG:ClusterShell.Worker.Tree:trying gateway a1 to reach d[1-4]
DEBUG:ClusterShell.Worker.Tree:_execute_remote gateway=a1 cmd=hostname targets=d[1-4]
DEBUG:ClusterShell.Task:pchannel: creating new channel <ClusterShell.Propagation.PropagationChannel object at 0x7fe888841780>
SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes a1 python3 -m ClusterShell.Gateway -Bu
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Propagation:shell nodes=d[1-4] timeout=-1 worker=140636699497168 remote=True
DEBUG:ClusterShell.Propagation:send_queued: 0
DEBUG:ClusterShell.Worker.Tree:TreeWorker: _check_ini (0, 0)
a1: b'<?xml version="1.0" encoding="utf-8"?>'
a1: b'<channel version="1.8.4"><message type="ACK" msgid="2" ack="0"></message>'
DEBUG:ClusterShell.Propagation:recv: Message CHA (type: CHA, msgid: 2)
DEBUG:ClusterShell.Propagation:channel started (version 1.8.4 on remote gateway)
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 2, ack: 0)
DEBUG:ClusterShell.Propagation:recv_cfg
DEBUG:ClusterShell.Propagation:CTL - connection with gateway fully established
DEBUG:ClusterShell.Propagation:dequeuing sendq: Message CTL (type: CTL, msgid: 1, srcid: 140636699497168, action: shell, target: d[1-4])
a1: b'<message type="ACK" msgid="8" ack="1"></message>'
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 8, ack: 1)
DEBUG:ClusterShell.Propagation:got ack (ACK)
DEBUG:ClusterShell.Propagation:ev_close gateway=a1 <ClusterShell.Propagation.PropagationChannel object at 0x7fe888841780>
DEBUG:ClusterShell.Propagation:ev_close rc=0
$ cat topology.conf
[routes]
rootnode: a1,a2
a1: b1,b2
a2: b3,b4
b1: d1,d2
b2: d3,d4
b3: d5,d6
b4: d7,d8

with this I have no problem reaching two levels deep (b[1-4]) but I can't seem to reach any of the d nodes three levels deep. We can see in debug level that the a1 gateway closes too early, presumably it thinks it's done from b level ack when it shouldn't...

running with CLUSTERSHELL_GW_LOG_LEVEL=debug, here's the logs of the first level of the gw (a1):

2022-02-19 11:35:55,545 ClusterShell.Gateway DEBUG Starting task
2022-02-19 11:35:55,545 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,545 ClusterShell.Gateway DEBUG ready to accept channel communication
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG handling incoming message: Message CHA (type: CHA, msgid: 0)
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG got start message Message CHA (type: CHA, msgid: 0)
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG channel started (version 1.8.3 on remote end)
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG handling incoming message: Message CFG (type: CFG, msgid: 0, gateway: a1)
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG got channel configuration
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG using gateway node name a1
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG gw name a1 does not match system hostname myhostname
2022-02-19 11:35:55,547 ClusterShell.Gateway DEBUG decoded propagation tree
2022-02-19 11:35:55,547 ClusterShell.Gateway DEBUG 
myhostname
|- a1
|  |- b1
|  |  `- d[1-2]
|  `- b2
|     `- d[3-4]
`- a2
   |- b3
   |  `- d[5-6]
   `- b4
      `- d[7-8]

2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG handling incoming message: Message CTL (type: CTL, msgid: 1, srcid: 139997685073808, action: shell, target: d[1,4])
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG GatewayChannel._state_ctl
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG decoded gw invoke (PYTHONPATH=/home/shared/clustershell/lib CLUSTERSHELL_GW_LOG_LEVEL=debug python3 -m ClusterShell.Gateway -Bu)
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG assigning task infos ({'debug': True, 'fanout': 64, 'grooming_delay': 0.25, 'connect_timeout': 15.0, 'command_timeout': 0.0})
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG inherited fanout value=64
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG launching execution/enter gathering state
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG TreeWorkerResponder initialized grooming=0.250000
2022-02-19 11:35:55,550 ClusterShell.Worker.Tree DEBUG stderr=True
2022-02-19 11:35:55,550 ClusterShell.Worker.Tree DEBUG TreeWorker._launch on d[1,4] (fanout=64)
2022-02-19 11:35:55,551 ClusterShell.Worker.Tree DEBUG next_hops=[('b1', 'd1'), ('b2', 'd4')]
2022-02-19 11:35:55,551 ClusterShell.Worker.Tree DEBUG trying gateway b1 to reach d1
2022-02-19 11:35:55,551 ClusterShell.Worker.Tree DEBUG _execute_remote gateway=b1 cmd=hostname targets=d1
2022-02-19 11:35:55,551 ClusterShell.Task DEBUG pchannel: creating new channel <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad1d50>
2022-02-19 11:35:55,552 ClusterShell.Gateway DEBUG SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes b1 PYTHONPATH=/home/shared/clustershell/lib CLUSTERSHELL_GW_LOG_LEVEL=debug python3 -m ClusterShell.Gateway -Bu
2022-02-19 11:35:55,553 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,554 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,554 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,554 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,555 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,555 ClusterShell.Propagation DEBUG shell nodes=d1 timeout=-1 worker=139777121523344 remote=True
2022-02-19 11:35:55,555 ClusterShell.Propagation DEBUG send_queued: 0
2022-02-19 11:35:55,555 ClusterShell.Worker.Tree DEBUG trying gateway b2 to reach d4
2022-02-19 11:35:55,555 ClusterShell.Worker.Tree DEBUG _execute_remote gateway=b2 cmd=hostname targets=d4
2022-02-19 11:35:55,556 ClusterShell.Task DEBUG pchannel: creating new channel <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad2440>
2022-02-19 11:35:55,556 ClusterShell.Gateway DEBUG SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes b2 PYTHONPATH=/home/shared/clustershell/lib CLUSTERSHELL_GW_LOG_LEVEL=debug python3 -m ClusterShell.Gateway -Bu
2022-02-19 11:35:55,557 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,558 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,558 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,558 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,559 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,559 ClusterShell.Propagation DEBUG shell nodes=d4 timeout=-1 worker=139777121523344 remote=True
2022-02-19 11:35:55,560 ClusterShell.Propagation DEBUG send_queued: 0
2022-02-19 11:35:55,560 ClusterShell.Worker.Tree DEBUG TreeWorker: _check_ini (0, 0)
2022-02-19 11:35:55,560 ClusterShell.Gateway DEBUG TreeWorkerResponder: ev_start
2022-02-19 11:35:55,560 ClusterShell.Gateway DEBUG TreeWorker scheduled
2022-02-19 11:35:55,891 ClusterShell.Gateway DEBUG b1: b'<?xml version="1.0" encoding="utf-8"?>'
2022-02-19 11:35:55,893 ClusterShell.Gateway DEBUG b1: b'<channel version="1.8.3"><message type="ACK" msgid="2" ack="4"></message>'
2022-02-19 11:35:55,893 ClusterShell.Propagation DEBUG recv: Message CHA (type: CHA, msgid: 9)
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG channel started (version 1.8.3 on remote gateway)
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG recv: Message ACK (type: ACK, msgid: 2, ack: 4)
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG recv_cfg
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG CTL - connection with gateway fully established
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG dequeuing sendq: Message CTL (type: CTL, msgid: 5, srcid: 139777121523344, action: shell, target: d1)
2022-02-19 11:35:55,900 ClusterShell.Gateway DEBUG b1: b'<message type="ACK" msgid="4" ack="5"></message>'
2022-02-19 11:35:55,901 ClusterShell.Propagation DEBUG recv: Message ACK (type: ACK, msgid: 4, ack: 5)
2022-02-19 11:35:55,901 ClusterShell.Propagation DEBUG got ack (ACK)
2022-02-19 11:35:55,933 ClusterShell.Gateway DEBUG b2: b'<?xml version="1.0" encoding="utf-8"?>'
2022-02-19 11:35:55,935 ClusterShell.Gateway DEBUG b2: b'<channel version="1.8.3"><message type="ACK" msgid="2" ack="6"></message>'
2022-02-19 11:35:55,935 ClusterShell.Propagation DEBUG recv: Message CHA (type: CHA, msgid: 12)
2022-02-19 11:35:55,935 ClusterShell.Propagation DEBUG channel started (version 1.8.3 on remote gateway)
2022-02-19 11:35:55,935 ClusterShell.Propagation DEBUG recv: Message ACK (type: ACK, msgid: 2, ack: 6)
2022-02-19 11:35:55,936 ClusterShell.Propagation DEBUG recv_cfg
2022-02-19 11:35:55,936 ClusterShell.Propagation DEBUG CTL - connection with gateway fully established
2022-02-19 11:35:55,936 ClusterShell.Propagation DEBUG dequeuing sendq: Message CTL (type: CTL, msgid: 7, srcid: 139777121523344, action: shell, target: d4)
2022-02-19 11:35:55,942 ClusterShell.Gateway DEBUG b2: b'<message type="ACK" msgid="4" ack="7"></message>'
2022-02-19 11:35:55,943 ClusterShell.Propagation DEBUG recv: Message ACK (type: ACK, msgid: 4, ack: 7)
2022-02-19 11:35:55,943 ClusterShell.Propagation DEBUG got ack (ACK)
2022-02-19 11:35:56,169 ClusterShell.Gateway DEBUG b2: b'<message type="OUT" msgid="5" srcid="139777121523344" nodes="d4">gASVGAAAAAAAAABDFGZlbnJpci5jb2Rld3JlY2sub3JnlC4=</message>'
2022-02-19 11:35:56,169 ClusterShell.Propagation DEBUG recv: Message OUT (type: OUT, msgid: 5, srcid: 139777121523344, nodes: d4)
2022-02-19 11:35:56,170 ClusterShell.Gateway DEBUG b2: b'<message type="RET" msgid="6" srcid="139777121523344" retcode="0" nodes="d4"></message>'
2022-02-19 11:35:56,170 ClusterShell.Propagation DEBUG recv: Message RET (type: RET, msgid: 6, srcid: 139777121523344, retcode: 0, nodes: d4)
2022-02-19 11:35:56,170 ClusterShell.Worker.Tree DEBUG _on_remote_node_close d4 0 via gw b2
2022-02-19 11:35:56,171 ClusterShell.Worker.Tree DEBUG check_fini 1 2
2022-02-19 11:35:56,171 ClusterShell.Worker.Tree DEBUG TreeWorker._check_fini <ClusterShell.Worker.Tree.TreeWorker object at 0x7f2065ad1690> call pchannel_release for gw b2
2022-02-19 11:35:56,171 ClusterShell.Task DEBUG pchannel_release b2 <ClusterShell.Worker.Tree.TreeWorker object at 0x7f2065ad1690>
2022-02-19 11:35:56,171 ClusterShell.Task DEBUG pchannel_release: destroying channel <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad2440>
2022-02-19 11:35:56,172 ClusterShell.Propagation DEBUG ev_close gateway=b2 <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad2440>
2022-02-19 11:35:56,172 ClusterShell.Propagation DEBUG ev_close rc=None
2022-02-19 11:35:56,172 ClusterShell.Propagation DEBUG error on gateway b2 (setup=True)
2022-02-19 11:35:56,173 ClusterShell.Gateway DEBUG GatewayChannel: ev_close
2022-02-19 11:35:56,176 ClusterShell.Engine.Engine DEBUG Traceback (most recent call last):
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 723, in run
    self.runloop(timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/EPoll.py", line 157, in runloop
    client._handle_read(sname)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 192, in _handle_read
    node_msgline(key, msg, sname)  # handle full msg line
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 166, in _on_nodeset_msgline
    self.worker._on_node_msgline(nodes, msg, sname)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 279, in _on_node_msgline
    self.eh.ev_read(self, node, sname, msg)
  File "/home/shared/clustershell/lib/ClusterShell/Communication.py", line 258, in ev_read
    self.recv(msg)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 270, in recv
    self.recv_ctl(msg)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 376, in recv_ctl
    metaworker._on_remote_node_close(node, rc, self.gateway)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Tree.py", line 446, in _on_remote_node_close
    self._check_fini(gateway)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Tree.py", line 499, in _check_fini
    self.task._pchannel_release(gateway, self)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 1367, in _pchannel_release
    chanworker.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 360, in abort
    client.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/EngineClient.py", line 438, in abort
    engine.remove(self, abort=True)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 495, in remove
    self._remove(client, abort, did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 142, in _close
    self.worker._check_fini()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 383, in _check_fini
    _eh_sigspec_invoke_compat(self.eh.ev_close, 2, self,
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 52, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 411, in ev_close
    self.task.router.mark_unreachable(gateway)
AttributeError: 'NoneType' object has no attribute 'mark_unreachable'

2022-02-19 11:35:56,177 ClusterShell.Propagation DEBUG ev_close gateway=b1 <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad1d50>
2022-02-19 11:35:56,177 ClusterShell.Propagation DEBUG ev_close rc=None
2022-02-19 11:35:56,177 ClusterShell.Propagation DEBUG error on gateway b1 (setup=True)
2022-02-19 11:35:56,177 ClusterShell.Gateway ERROR Gateway failure: 'NoneType' object has no attribute 'mark_unreachable'
Traceback (most recent call last):
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 723, in run
    self.runloop(timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/EPoll.py", line 157, in runloop
    client._handle_read(sname)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 192, in _handle_read
    node_msgline(key, msg, sname)  # handle full msg line
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 166, in _on_nodeset_msgline
    self.worker._on_node_msgline(nodes, msg, sname)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 279, in _on_node_msgline
    self.eh.ev_read(self, node, sname, msg)
  File "/home/shared/clustershell/lib/ClusterShell/Communication.py", line 258, in ev_read
    self.recv(msg)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 270, in recv
    self.recv_ctl(msg)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 376, in recv_ctl
    metaworker._on_remote_node_close(node, rc, self.gateway)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Tree.py", line 446, in _on_remote_node_close
    self._check_fini(gateway)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Tree.py", line 499, in _check_fini
    self.task._pchannel_release(gateway, self)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 1367, in _pchannel_release
    chanworker.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 360, in abort
    client.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/EngineClient.py", line 438, in abort
    engine.remove(self, abort=True)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 495, in remove
    self._remove(client, abort, did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 142, in _close
    self.worker._check_fini()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 383, in _check_fini
    _eh_sigspec_invoke_compat(self.eh.ev_close, 2, self,
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 52, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 411, in ev_close
    self.task.router.mark_unreachable(gateway)
AttributeError: 'NoneType' object has no attribute 'mark_unreachable'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 772, in _resume
    self._run(self.timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 400, in _run
    self._engine.run(timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 732, in run
    self.clear()
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 534, in clear
    self._remove(client, True, did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 452, in _close
    _eh_sigspec_invoke_compat(self.worker.eh.ev_close, 2, self, timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 52, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/home/shared/clustershell/lib/ClusterShell/Gateway.py", line 311, in ev_close
    self.worker.task.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 920, in abort
    self._abort(kill)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 206, in taskfunc
    return f(task, *fargs, **kwargs)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 909, in _abort
    self._engine.abort(kill)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 775, in abort
    raise EngineAbortException(kill)
ClusterShell.Engine.Engine.EngineAbortException

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shared/clustershell/lib/ClusterShell/Gateway.py", line 368, in gateway_main
    task.resume()
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 809, in resume
    self._resume()
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 776, in _resume
    self._terminate(exc.kill)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 945, in _terminate
    self._engine.clear(clear_ports=kill)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 534, in clear
    self._remove(client, True, did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 142, in _close
    self.worker._check_fini()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 383, in _check_fini
    _eh_sigspec_invoke_compat(self.eh.ev_close, 2, self,
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 52, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 411, in ev_close
    self.task.router.mark_unreachable(gateway)
AttributeError: 'NoneType' object has no attribute 'mark_unreachable'
2022-02-19 11:35:56,179 ClusterShell.Gateway DEBUG -------- The End --------

and one of the deeper gw (b2)

2022-02-19 11:35:55,932 ClusterShell.Gateway DEBUG Starting task
2022-02-19 11:35:55,932 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fc625aa38e0> not registered
2022-02-19 11:35:55,933 ClusterShell.Gateway DEBUG ready to accept channel communication
2022-02-19 11:35:55,933 ClusterShell.Gateway DEBUG handling incoming message: Message CHA (type: CHA, msgid: 0)
2022-02-19 11:35:55,933 ClusterShell.Gateway DEBUG got start message Message CHA (type: CHA, msgid: 0)
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG channel started (version 1.8.3 on remote end)
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG handling incoming message: Message CFG (type: CFG, msgid: 6, gateway: b2)
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG got channel configuration
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG using gateway node name b2
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG gw name b2 does not match system hostname myhostname
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG decoded propagation tree
2022-02-19 11:35:55,935 ClusterShell.Gateway DEBUG 
myhostname
|- a1
|  |- b1
|  |  `- d[1-2]
|  `- b2
|     `- d[3-4]
`- a2
   |- b3
   |  `- d[5-6]
   `- b4
      `- d[7-8]

2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG handling incoming message: Message CTL (type: CTL, msgid: 7, srcid: 139777121523344, action: shell, target: d4)
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG GatewayChannel._state_ctl
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG decoded gw invoke (PYTHONPATH=/home/shared/clustershell/lib CLUSTERSHELL_GW_LOG_LEVEL=debug python3 -m ClusterShell.Gateway -Bu)
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG assigning task infos ({'debug': True, 'fanout': 64, 'grooming_delay': 0.25, 'connect_timeout': 15.0, 'command_timeout': 0.0})
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG inherited fanout value=64
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG launching execution/enter gathering state
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG TreeWorkerResponder initialized grooming=0.250000
2022-02-19 11:35:55,938 ClusterShell.Worker.Tree DEBUG stderr=True
2022-02-19 11:35:55,939 ClusterShell.Worker.Tree DEBUG TreeWorker._launch on d4 (fanout=64)
2022-02-19 11:35:55,939 ClusterShell.Worker.Tree DEBUG next_hops=[('d4', 'd4')]
2022-02-19 11:35:55,939 ClusterShell.Worker.Tree DEBUG task.shell cmd=hostname source=None nodes=d4 timeout=-1 remote=True
2022-02-19 11:35:55,939 ClusterShell.Gateway DEBUG SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes d4 hostname
2022-02-19 11:35:55,941 ClusterShell.Worker.Tree DEBUG MetaWorkerEventHandler: ev_start
2022-02-19 11:35:55,941 ClusterShell.Worker.Tree DEBUG TreeWorker: _check_ini (1, 1)
2022-02-19 11:35:55,941 ClusterShell.Gateway DEBUG TreeWorkerResponder: ev_start
2022-02-19 11:35:55,942 ClusterShell.Worker.Tree DEBUG added child worker <ClusterShell.Worker.Ssh.WorkerSsh object at 0x7fc625786110> count=1
2022-02-19 11:35:55,942 ClusterShell.Worker.Tree DEBUG TreeWorker: _check_ini (1, 1)
2022-02-19 11:35:55,942 ClusterShell.Gateway DEBUG TreeWorkerResponder: ev_start
2022-02-19 11:35:55,942 ClusterShell.Gateway DEBUG TreeWorker scheduled
2022-02-19 11:35:56,166 ClusterShell.Gateway DEBUG d4: b'myhostname'
2022-02-19 11:35:56,167 ClusterShell.Worker.Tree DEBUG _on_node_close d4 0 (0)
2022-02-19 11:35:56,168 ClusterShell.Worker.Tree DEBUG MetaWorkerEventHandler: ev_close, timedout=False
2022-02-19 11:35:56,168 ClusterShell.Worker.Tree DEBUG check_fini 1 1
2022-02-19 11:35:56,168 ClusterShell.Gateway DEBUG TreeWorkerResponder: ev_close timedout=False
2022-02-19 11:35:56,168 ClusterShell.Gateway DEBUG iter(stdout): d4: 20 bytes
2022-02-19 11:35:56,169 ClusterShell.Gateway DEBUG iter(rc): d4: rc=0
2022-02-19 11:35:56,172 ClusterShell.Gateway DEBUG GatewayChannel: ev_close
2022-02-19 11:35:56,172 ClusterShell.Gateway DEBUG Task performed
2022-02-19 11:35:56,173 ClusterShell.Gateway DEBUG -------- The End --------

So from the second log we can see the command actually ran successfully, just couldn't come up because of the failure.

I've fixed that error in
https://review.gerrithub.io/c/cea-hpc/clustershell/+/533465
and running now works normally. There might be some missing fallbacks if a lower level gateway is unreachable however, that'd require some testing...

Hi Dominique,

many thanks for this fix. It works - great.
Bernd