Azure/azurehpc

NVIDIA #20210102.1 Pipeline Failure

xpillons opened this issue · 4 comments

Manually reran the pipeline. Gen2 passed.
Gen1 failed with error
Resource : gpumaster - OSProvisioningTimedOut
Message : OS Provisioning for VM 'gpumaster' did not finish in the
allotted time. The VM may still finish provisioning
successfully. Please check provisioning state later. For
details on how to check current provisioning state of
Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and
Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.
None
Allocating NV12s_v3 is taking too long

@xpillons, got a similar failure today running the nvidia pipiline.
https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10563&view=logs&j=40a7dfaa-edcf-57d7-da50-33204f1e0241&t=eef1fa0f-de1b-545a-8af2-256fc8a5c4c1&l=280
The time difference between "build install scripts" and the rsync error was only 2 seconds. The error is a connection refused. I believe we already check thad sshd is running before trying to connect, but this does not fix the problem. If there is not a quick fix for this (i.e some additional flag), then maybe it would be worth the time to re-architect this (i.e. replace rsync with something else?). This type of error is occurring too often.

@edwardsp can you have a look to check why the prsync is failing ? I can see in the code that ssh is tested upfront, but I'm not 100% sure about the sequence. Otherwise maybe we should add a retry in the rsyn python wrapper function

ssh isn't tested before the initial rsync so I have just added a PR to add a test for ssh.