NVIDIA #20210102.1 Pipeline Failure
xpillons opened this issue · 4 comments
-
rsync connection refused for ND40rs_v2 and Gen2 image https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10393&view=logs&j=55fc7d3f-746a-55f6-bb80-32903d2b68f6&t=c44d176c-8c9a-532a-3ce7-9ebfde9ffc60&l=388
-
ssh connection timeout for Standard_NV12s_v3 and Gen1 image. Timeout is after updating the kernel, LIS is not installed. AZHPC should failed on step 4 while it failed on step 6 https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10393&view=logs&j=40a7dfaa-edcf-57d7-da50-33204f1e0241&t=eef1fa0f-de1b-545a-8af2-256fc8a5c4c1&l=2655
Manually reran the pipeline. Gen2 passed.
Gen1 failed with error
Resource : gpumaster - OSProvisioningTimedOut
Message : OS Provisioning for VM 'gpumaster' did not finish in the
allotted time. The VM may still finish provisioning
successfully. Please check provisioning state later. For
details on how to check current provisioning state of
Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and
Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.
None
Allocating NV12s_v3 is taking too long
@xpillons, got a similar failure today running the nvidia pipiline.
https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10563&view=logs&j=40a7dfaa-edcf-57d7-da50-33204f1e0241&t=eef1fa0f-de1b-545a-8af2-256fc8a5c4c1&l=280
The time difference between "build install scripts" and the rsync error was only 2 seconds. The error is a connection refused. I believe we already check thad sshd is running before trying to connect, but this does not fix the problem. If there is not a quick fix for this (i.e some additional flag), then maybe it would be worth the time to re-architect this (i.e. replace rsync with something else?). This type of error is occurring too often.
@edwardsp can you have a look to check why the prsync is failing ? I can see in the code that ssh is tested upfront, but I'm not 100% sure about the sequence. Otherwise maybe we should add a retry in the rsyn python wrapper function
ssh isn't tested before the initial rsync so I have just added a PR to add a test for ssh.