mozilla-releng/balrog

500 Response Code using new /releases api (rarely succeeds)

Closed this issue · 2 comments

So in testing some logic that uses the new releases api I was getting permanent ISE 500's from balrog.

https://gist.github.com/Callek/ddcef74533c84accfbe568143b4681a0

Specifically my command was: curl -i -X PUT https://admin-stage.balrog.nonprod.cloudops.mozgcp.net/api/v2/releases/Firefox-75.0b10-build1-No-WNP -H "Content-Type: application/json" --data-binary "@new.Firefox-75.0b10-build1-No-Wnp.json" -H "Authorization: Bearer $balrog_bearer_token"

The interesting thing of course, is that ben had some successful attempts at this (though they were not common) and that the timing was bimodal... when wrapping that command with time most were 0.4->0.6seconds, the longer one was 8.9 seconds (longer was rarer for me)

Sentry didn't show any traceback lines either :/

image

So the immediate issue here is that the event loop that gcloud-aio is trying to use to fetch credentials is closed (because Balrog code is closing it). After further digging today, the the overarching issue is that we're ending up in a state where there's multiple event loops going around. In some form or another, Balrog needs to manage one (because we need to wait for the coroutines to finish as part of set_release). Because of that, we have to make sure that the correct event loop is passed along to any aiohttp or gcloud-aio classes, functions, etc. to keep things in sync.

I also discovered the reason that I wasn't hitting this locally is because it only happens after the initial gcloud access token expires. Locally, this takes an hour. In deployed environments it appears to be significantly quicker.

I've got a horribly hacky fix locally, which I'm working to improve before posting.

I believe this has been fixed by #1263.