[BUG] Intermittent `OSProvisioningTimedOut` errors when creating VMs for AKS cluster

Question

[BUG] Intermittent `OSProvisioningTimedOut` errors when creating VMs for AKS cluster

Closed this issue 4 months ago · 8 comments

Describe the bug

Our scenario:

We create VMs for customers and join them to their AKS clusters to scale up capacity as needed.
We use the AKS community images for the nodes (example gallery for westeurope: /CommunityGalleries/aksubuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/Images/2204gen2containerd). We pick the latest published there.
The issue seems to happen in multiple regions (eastus2, westeurope, centralus) so we don't have a correlation there.

The problem is that sometimes we get the error below and we are not sure how to detect or react to this properly.
As it takes ~20 minutes for the error to appear, it can cause scaling to be very slow and lead to downtime for customers.
We get this error around 3-10 times per day (and we add nodes roughly every 1-2 seconds) so it is very low volume. But we still want to understand how to fix or work around it.

https://management.azure.com/subscriptions/xxxx/providers/Microsoft.Compute/locations/centralus/operations/xxxx
--------------------------------------------------------------------------------
RESPONSE 200: 200 OK
ERROR CODE: OSProvisioningTimedOut
--------------------------------------------------------------------------------
{
  \"startTime\": \"2024-09-18T23:49:48.4348637+00:00\",
  \"endTime\": \"2024-09-19T00:10:01.2801821+00:00\",
  \"status\": \"Failed\",
  \"error\": {
    \"code\": \"OSProvisioningTimedOut\",
    \"message\": \"OS Provisioning for VM 'XXXXXX' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\
 * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\
 * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \\r\
 * If you are deploying more than 20 Virtual Machines concurrently, consider moving your custom image to shared image gallery. Please refer to https://aka.ms/movetosig for the same.\",
    \"target\": \"0\"
  },
  \"name\": \"xxxx\"
}
--------------------------------------------------------------------------------

The question is:

Can we improve something on our side to reduce the likelihood of this error?
Can we specify some timeout to avoid waiting for 20minutes before getting this error?
Is it a known issue on Azure side in general?

To Reproduce
Unfortunately, we cannot reproduce the error reliably.

Expected behavior
No error when provisioning or clear instructions how to avoid long timeout waiting.

Screenshots
N/A

Environment (please complete the following information):
N/A

CLI Version - N/A
Kubernetes version - multiple
CLI Extension version - N/A
Browser - N/A

Additional context
N/A

Answer 1 · 2024-09-19T17:34:10.000Z

Are you not using cluster auto scaler or node auto provision to add the nodes?

Answer 2 · 2024-09-19T18:21:04.000Z

No, we create and join the nodes to the cluster ourselves. The approach is similar to karpenter for azure (and it uses the same vm images), but not identical. VMs are created as single-node VMSS.

…

On Thu, Sep 19, 2024 at 20:34 Richard Hooper ***@***.***> wrote: Are you not using cluster auto scaler or node auto provision to add the nodes? — Reply to this email directly, view it on GitHub <#4553 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANTHFISV2DEB6WG3MB2HPTZXMDKRAVCNFSM6AAAAABOPUKM5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRRG44TSMBWGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 3 · 2024-09-20T12:28:12.000Z

It sounds like what you are doing is not a supported way to add nodes to an AKS cluster. Have you opened a support ticket? It might be a VM compute issue rather than an AKS issue.

Answer 4 · 2024-09-20T13:16:42.000Z

Good call, we opened a support ticket.

It's possible this is not the right repo to log the issue - not sure if the issue comes from the VHD image build (which comes from https://github.com/Azure/AgentBaker perhaps?), the Azure Compute that provisions VM or the fact that the node is trying to join an AKS cluster.

Answer 5 · 2024-10-20T22:30:31.000Z

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

Answer 6 · 2024-11-05T04:53:23.000Z

Issue needing attention of @Azure/aks-leads

Answer 7 · 2024-11-20T08:21:49.000Z

Issue needing attention of @Azure/aks-leads

Answer 8 · 2024-12-04T12:53:02.000Z

Closing this, we added proper timeouts to catch OS failed provisioning and retries. From azure support ticket, it looks like an intermittent issue that can sometime happen when provisioning a VM.