Azure/az-hop

Nodes (Standard_NC24ads_A100_v4) not configuring anymore for v1.0.40

jlphillipsphd opened this issue · 2 comments

Version

v1.0.40
slurm: 22.05.3
cyclecloud: 2.7.2

In what area(s)?

/area ansible
/area autoscaling
/area configuration
/area cyclecloud

Expected Behavior

Nodes (Standard_NC24ads_A100_v4) should autoscale and be configured properly for workloads.

Actual Behavior

Nodes have spawned and report that they are being configured for workloads, but ultimately terminate before the job is started. I can log into the node via SSH and see that it is active, but something isn't connecting correctly. I will post back with the specific error message (was a python error from jetpack) when I get a chance to try again, but these were working up to a week or so prior.

Steps to Reproduce the Problem

Deploy gpu cluster (Standard_NC24ads_A100_v4) using the versions above.

Wow, fishing in the first few seconds, not suspect at all...

can you please check the slurm logs on the node when this occurs ?