Nodes (Standard_NC24ads_A100_v4) not configuring anymore for v1.0.40
jlphillipsphd opened this issue · 2 comments
Version
v1.0.40
slurm: 22.05.3
cyclecloud: 2.7.2
In what area(s)?
/area ansible
/area autoscaling
/area configuration
/area cyclecloud
Expected Behavior
Nodes (Standard_NC24ads_A100_v4) should autoscale and be configured properly for workloads.
Actual Behavior
Nodes have spawned and report that they are being configured for workloads, but ultimately terminate before the job is started. I can log into the node via SSH and see that it is active, but something isn't connecting correctly. I will post back with the specific error message (was a python error from jetpack) when I get a chance to try again, but these were working up to a week or so prior.
Steps to Reproduce the Problem
Deploy gpu cluster (Standard_NC24ads_A100_v4) using the versions above.
Wow, fishing in the first few seconds, not suspect at all...
can you please check the slurm logs on the node when this occurs ?