cea-hpc/clustershell

Leading Zeros in Nodesets Are Considered Non-Significant by ClusterShell

Closed this issue · 8 comments

On our customer platform, compute nodes hostnames range from 0001 to 5000.

We have accidentally discovered that ClusterShell does not make any difference between the following two Nodesets:

  • node[0541-0570] (Valid Nodeset)
  • node[541-570] (No leading zero = Invalid Nodeset)

For instance:

# clush -bw "node[541-570]" "cat /etc/redhat-release"
---------------
node[0541-0570] (31)
---------------
Red Hat Enterprise Linux release 8.6 (Ootpa)

Is this an expected behavior? This looks rather confusing to us, and we would expect an error message when dealing with the invalid Nodeset.

What is invalid about this?

Ah, sorry I misread the end of what you wrote. nodeset should expand the variant without leading zeroes without the leading zeroes, but perhaps there's a bug with topology settings or something else.

  • What does nodeset -f node[541-570] return
  • Please share a little bit more about your config

Hi Nicolas, this is NOT an expected behavior.

Could you tell us which ClusterShell version are you using? The zero-padding code has changed in 1.9 https://clustershell.readthedocs.io/en/latest/release.html#node-sets-and-node-groups, but it does not explain what you see.

I'm suspecting a local config in your system causing that behavior.

Hi Dominique / Aurélien, thank you both for your replies.

Could you tell us which ClusterShell version are you using?

We have been using version 1.8.3 of ClusterShell.

What does nodeset -f node[541-570] return?

# nodeset -f node[541-570]
node[541-570]

I'm suspecting a local config in your system causing that behavior.

Any idea where I should look at in the configuration? Which part of it could cause such a behavior?

Thanks.

Is that a behavior happening for any nodeset? any user?

Could you run clush -d -v -bw "node[541-570]" "cat /etc/redhat-release"

Based on your explanation, it seems node541 does not exist, as the node ranges from 0000 to 5000, right?

What ssh node541 hostname says?

Is that a behavior happening for any nodeset? any user?

Yes, indeed.

Based on your explanation, it seems node541 does not exist, as the node ranges from 0000 to 5000, right?

Correct: node541 does not exist, but node0541 does and ClusterShell kind of maps node541 to node0541.
By the way, just found out that the opposite situation (too many leading zeros) also leads to the same outcome - for instance:

# clush -w node00541 "cat /etc/redhat-release"
---------------
node0541
---------------
Red Hat Enterprise Linux release 8.6 (Ootpa)

(node00541 doesn't exist either but is mapped to existing node0541)

What ssh node541 hostname says?

Unsurprisingly:

# ssh node541
ssh: Could not resolve hostname node541: Name or service not known

Could you run clush -d -v -bw "node[541-570]" "cat /etc/redhat-release"

Here is the output of the command (slightly edited to reduce its size) - please note that distributed execution goes through gateways:

DEBUG:root:clush: STARTING DEBUG
clush: enabling tree topology (13 gateways)
DEBUG:ClusterShell.Worker.Tree:stderr=True
DEBUG:ClusterShell.Worker.Tree:TreeWorker._launch on node[541-542] (fanout=64)
DEBUG:ClusterShell.Worker.Tree:next_hops=[('gateway.region2.svc.kube.local', 'node[0541-0542]')]
DEBUG:ClusterShell.Worker.Tree:trying gateway gateway.region2.svc.kube.local to reach node[0541-0542]
DEBUG:ClusterShell.Worker.Tree:_execute_remote gateway=gateway.region2.svc.kube.local cmd=cat /etc/redhat-release targets=node[0541-0542]
DEBUG:ClusterShell.Task:pchannel: creating new channel <ClusterShell.Propagation.PropagationChannel object at 0x7fab2045fac8>
DEBUG:ClusterShell.Propagation:shell nodes=node[0541-0542] timeout=-1 worker=140372962332568 remote=True
DEBUG:ClusterShell.Propagation:send_queued: 0
DEBUG:ClusterShell.Worker.Tree:TreeWorker: _check_ini (0, 0)
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fab209b7e80> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fab209b7e80> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fab209b7e80> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fab209b7e80> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fab209b7e80> not registered
DEBUG:ClusterShell.Propagation:recv: Message CHA (type: CHA, msgid: 2)
DEBUG:ClusterShell.Propagation:channel started (version 1.8.3 on remote gateway)
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 2, ack: 1)
DEBUG:ClusterShell.Propagation:recv_cfg
DEBUG:ClusterShell.Propagation:CTL - connection with gateway fully established
DEBUG:ClusterShell.Propagation:dequeuing sendq: Message CTL (type: CTL, msgid: 0, srcid: 140372962332568, action: shell, target: node[0541-0542])
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 4, ack: 0)
DEBUG:ClusterShell.Propagation:got ack (ACK)
DEBUG:ClusterShell.Propagation:recv: Message OUT (type: OUT, msgid: 5, srcid: 140372962332568, nodes: node[0541-0542])
DEBUG:ClusterShell.Propagation:recv: Message RET (type: RET, msgid: 6, srcid: 140372962332568, retcode: 0, nodes: node[0541-0542])
DEBUG:ClusterShell.Worker.Tree:_on_remote_node_close node0541 0 via gw gateway.region2.svc.kube.local
DEBUG:ClusterShell.Worker.Tree:check_fini 1 2
DEBUG:ClusterShell.Worker.Tree:_on_remote_node_close node0542 1 via gw gateway.region2.svc.kube.local
DEBUG:ClusterShell.Worker.Tree:check_fini 2 2
DEBUG:ClusterShell.Worker.Tree:TreeWorker._check_fini <ClusterShell.Worker.Tree.TreeWorker object at 0x7fab208e4f98> call pchannel_release for gw gateway.region2.svc.kube.local
DEBUG:ClusterShell.Task:pchannel_release gateway.region2.svc.kube.local <ClusterShell.Worker.Tree.TreeWorker object at 0x7fab208e4f98>
DEBUG:ClusterShell.Task:pchannel_release: destroying channel <ClusterShell.Propagation.PropagationChannel object at 0x7fab2045fac8>
DEBUG:ClusterShell.Propagation:ev_close gateway=gateway.region2.svc.kube.local <ClusterShell.Propagation.PropagationChannel object at 0x7fab2045fac8>
DEBUG:ClusterShell.Propagation:ev_close rc=None
Changing max open files soft limit from 1024 to 8192
User interaction: False
Create STDIN worker: False
clush: nodeset=node[541-542] fanout=64 [timeout conn=15.0 cmd=0.0] command="cat /etc/redhat-release"
admin[0-2]
|- gateway.region1.svc.kube.local
|  `- node[0001-0360]
|- gateway.region2.svc.kube.local
|  `- node[0361-0720]
|- gateway.region3.svc.kube.local
|  `- node[0721-1080]
[...]
---------------
node[0541-0542] (2)
---------------
Red Hat Enterprise Linux release 8.6 (Ootpa)
SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes gateway.region2.svc.kube.local python3.6 -m ClusterShell.Gateway -Bu
gateway.region2.svc.kube.local: b'<?xml version="1.0" encoding="utf-8"?>'
gateway.region2.svc.kube.local: b'<channel version="1.8.3"><message type="ACK" msgid="2" ack="1"></message>'
gateway.region2.svc.kube.local: b'<message type="ACK" msgid="4" ack="0"></message>'
gateway.region2.svc.kube.local: b'<message type="OUT" msgid="5" srcid="140372962332568" nodes="node[0541-0542]">gANDLFJlZCBIYXQgRW50ZXJwcmlzZSBMaW51eCByZWxlYXNlIDguNiAoT290cGEpcQAu</message>'
gateway.region2.svc.kube.local: b'<message type="RET" msgid="6" srcid="140372962332568" retcode="0" nodes="node[0541-0542]"></message>'

You're using the Tree features with some gateways. Could you dump the topology.conf?

I feel like the code change in 1.9 will fix that. If you can do that easily, try updating your main node and gateways to 1.9.1.

I've a strong feeling that the behavior change will fix it:
https://clustershell.readthedocs.io/en/latest/release.html#node-sets-and-node-groups

After upgrading to version 1.9 on one Management Node and a couple of gateways, we do confirm that the issue disappears.

It looks like it fixes at the same time a bug we hadn't reported yet impacting the Clush Copy feature.

Thank you very much for your support.