Azure/AKS

[BUG] Intermittent `OSProvisioningTimedOut` errors when creating VMs for AKS cluster

Closed this issue · 8 comments

Describe the bug

Our scenario:

  • We create VMs for customers and join them to their AKS clusters to scale up capacity as needed.
  • We use the AKS community images for the nodes (example gallery for westeurope: /CommunityGalleries/aksubuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/Images/2204gen2containerd). We pick the latest published there.
  • The issue seems to happen in multiple regions (eastus2, westeurope, centralus) so we don't have a correlation there.

The problem is that sometimes we get the error below and we are not sure how to detect or react to this properly.
As it takes ~20 minutes for the error to appear, it can cause scaling to be very slow and lead to downtime for customers.
We get this error around 3-10 times per day (and we add nodes roughly every 1-2 seconds) so it is very low volume. But we still want to understand how to fix or work around it.

https://management.azure.com/subscriptions/xxxx/providers/Microsoft.Compute/locations/centralus/operations/xxxx
--------------------------------------------------------------------------------
RESPONSE 200: 200 OK
ERROR CODE: OSProvisioningTimedOut
--------------------------------------------------------------------------------
{
  \"startTime\": \"2024-09-18T23:49:48.4348637+00:00\",
  \"endTime\": \"2024-09-19T00:10:01.2801821+00:00\",
  \"status\": \"Failed\",
  \"error\": {
    \"code\": \"OSProvisioningTimedOut\",
    \"message\": \"OS Provisioning for VM 'XXXXXX' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\
 * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\
 * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \\r\
 * If you are deploying more than 20 Virtual Machines concurrently, consider moving your custom image to shared image gallery. Please refer to https://aka.ms/movetosig for the same.\",
    \"target\": \"0\"
  },
  \"name\": \"xxxx\"
}
--------------------------------------------------------------------------------

The question is:

  • Can we improve something on our side to reduce the likelihood of this error?
  • Can we specify some timeout to avoid waiting for 20minutes before getting this error?
  • Is it a known issue on Azure side in general?

To Reproduce
Unfortunately, we cannot reproduce the error reliably.

Expected behavior
No error when provisioning or clear instructions how to avoid long timeout waiting.

Screenshots
N/A

Environment (please complete the following information):
N/A

  • CLI Version - N/A
  • Kubernetes version - multiple
  • CLI Extension version - N/A
  • Browser - N/A

Additional context
N/A

Are you not using cluster auto scaler or node auto provision to add the nodes?

It sounds like what you are doing is not a supported way to add nodes to an AKS cluster. Have you opened a support ticket? It might be a VM compute issue rather than an AKS issue.

Good call, we opened a support ticket.

It's possible this is not the right repo to log the issue - not sure if the issue comes from the VHD image build (which comes from https://github.com/Azure/AgentBaker perhaps?), the Azure Compute that provisions VM or the fact that the node is trying to join an AKS cluster.

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Closing this, we added proper timeouts to catch OS failed provisioning and retries. From azure support ticket, it looks like an intermittent issue that can sometime happen when provisioning a VM.