Azure/go-autorest

context timeout 500ms for IMDS healthcheck is aggressive

aramase opened this issue · 8 comments

tempCtx, cancel := context.WithTimeout(ctx, 500*time.Millisecond)

The 500ms request timeout is aggressive. Based on inputs from @rkammara12 (from IMDS team), this timeout should be longer as it can generate false negative errors. Could we make it configurable/change the default to a more appropriate value based on the IMDS SLA?

@aramase is there a prescribed default that we should be using?

CC @chlowell

@jhendrixMSFT The current SLA for any requests to IMDS is 10sec, I would suggest a timeout of 2-5sec for all the health check requests.

I also suggest not to perform a health check for every request which looks inefficient, instead let the client applications decide when they want to perform a health check and the library can provide an API to do so. This way there is less burden on the IMDS process.

We're using the Azure Storage SDK (Azure.Storage.Queues specifically) with DefaultAzureCredentials for MSI access in a dotnet core app running in AKS, and we're seeing occasional 403s in our nmi container (+ failed storage calls) because of this issue:

server.go:392] failed to get service principal token for pod: <pod name>, error: failed to acquire a token using the MSI VM extension, error: MSI not available
server.go:199] status (403) took 501317096 ns...

@jnazaren Pod Identity can take up to a couple minutes to get ready for token requests from new pods. The team recently added a feature flag you can use to prevent timeouts due to that delay: https://azure.github.io/aad-pod-identity/docs/configure/feature_flags/#set-retry-after-header-in-nmi-response

@chlowell This is a different issue. Random token requests fail because the IMDS health check fails. The context timeout of 500ms generates a false negative error because IMDS takes time to respond.

We have been hit by IMDS being unavailable and been throttled before as well. I'd like to make sure what the impact of sending healthchecks has when performed for every request.
Are these healthchecks requests counted towards the throttling quota? if yes, please make it configurable, or leave it to the application to perform them.

is there any eta , we had a major downtime today.
I1213 14:57:28.257169 1 server.go:199] status (403) took 586168248 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.45.126
E1213 14:57:28.257303 1 server.go:392] failed to get service principal token for pod: default/enricher-6, error: failed to acquire a token using the MSI VM extension, error: MSI not available
I1213 14:57:28.257334 1 server.go:199] status (403) took 194273295 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.45.126
E1213 14:57:28.257343 1 server.go:392] failed to get service principal token for pod: default/enricher-6, error: failed to acquire a token using the MSI VM extension, error: MSI not available
I1213 14:57:28.257370 1 server.go:199] status (403) took 397841367 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.45.126
E1213 14:57:28.257485 1 server.go:392] failed to get service principal token for pod: default/enricher-6, error: failed to acquire a token using the MSI VM extension, error: MSI not available
I1213 14:57:28.257523 1 server.go:199] status (403) took 586606757 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.45.126
E1213 14:57:28.257598 1 server.go:392] failed to get service principal token for pod: default/enricher-6, error: failed to acquire a token using the MSI VM extension, error: MSI not available
I1213 14:57:28.257749 1 server.go:199] status (403) took 586845961 ns for req.method=GET reg.path=/metadata/identity/oauth2/token

thanks

Fixed in autorest/adal/v0.9.18 autorest/azure/auth/v0.5.10