mozilla-releng/balrog

/users/<username> sometimes hits auth0 rate limit errors

Opened this issue · 8 comments

Presumably this is happening because of the call we make to auth0's /userinfo endpoint, which is rate limited to 5 requests per minute with bursts of up to 10 requests per user id (from https://auth0.com/docs/policies/rate-limits#authentication-api).

We do cache the results of these calls, but we make one request per username at roughly the same time when /users is loaded, and we have multiple admin webheads, so it could take for all the webheads to have cached results for all of the users.

Off the top of my head, the only way I can think to fix this is to cache the results of the /userinfo queries somewhere persistent that can be shared between webheads. Right now, the only thing we have that persists is the mysql database, but we've talked about adding memcache at some point.

There may also be a more clever fix that I haven't considered.

I confirm this is happning more often than not. We should prioritise it at some point in H2, though certainly not urgent. Might be a great good-first-bug for someone else ramping-up in Balrog too.

I don't think this is a good first bug until we have a clear path forward. The original comment here has one idea, but honestly, it's pretty terrible - there's most likely a better fix.

Is there some sort of polling happening to check oath token expiration rather than checking it in the app? This guess might be completely off, since I don't know the code that well. :) I think Treeherder requests the token from the front end and passes it to the backend, validates it and uses it for session management (so whenever someone navigates to the app, the user API (in Treeherder) will check the session/auth status).

Presumably this is happening because of the call we make to auth0's /userinfo endpoint, which is rate limited to 5 requests per minute with bursts of up to 10 requests per user id (from https://auth0.com/docs/policies/rate-limits#authentication-api).

I think treeherder just extracts that info in the backend after validating the token and stores it in a user table (basically email, name and other stuff). Not sure if this helps at all :) I can help look into this later if you want, once I finish focus taskgraph work.

Is there some sort of polling happening to check oath token expiration rather than checking it in the app? This guess might be completely off, since I don't know the code that well. :) I think Treeherder requests the token from the front end and passes it to the backend, validates it and uses it for session management (so whenever someone navigates to the app, the user API (in Treeherder) will check the session/auth status).

My memory is really shoddy on this, but as a quick walkthrough, we're talking about requests like this to the backend: https://aus4-admin.mozilla.org/api/users/bhearsum@mozilla.com

Which end up running code like https://github.com/mozilla-releng/balrog/blob/main/src/auslib/util/auth.py#L56, which eventually call auth0 at https://github.com/mozilla-releng/balrog/blob/main/src/auslib/util/auth.py#L53 (if the cache is cold or expired).

Another thing that doesn't help is that we have multiple replicas on the backend -- so even though we're caching in memory on each one, there's multiple instances doing it.

Here's a full traceback of the actual error (from https://sentry.prod.mozaws.net/operations/prod-admin/issues/10855572/?query=is%3Aunresolved):

RateLimitError: 429: Too Many Requests
  File "flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "connexion/decorators/decorator.py", line 48, in wrapper
    response = function(request)
  File "connexion/decorators/uri_parsing.py", line 144, in wrapper
    response = function(request)
  File "connexion/decorators/validation.py", line 384, in wrapper
    return function(request)
  File "connexion/decorators/parameter.py", line 121, in wrapper
    return function(**kwargs)
  File "auslib/web/admin/views/mapper.py", line 99, in specific_user_get
    return SpecificUserView().get(username)
  File "auslib/web/admin/views/base.py", line 41, in decorated
    return f(*args, **kwargs)
  File "auslib/web/admin/views/base.py", line 17, in decorated
    username = verified_userinfo(request, app.config["AUTH_DOMAIN"], app.config["AUTH_AUDIENCE"])["email"]
  File "auslib/util/auth.py", line 80, in verified_userinfo
    payload.update(get_additional_userinfo(auth_domain, access_token))
  File "/usr/local/lib/python3.8/site-packages/repoze/lru/__init__.py", line 348, in cached_wrapper
    val = func(*args, **kwargs)
  File "auslib/util/auth.py", line 53, in get_additional_userinfo
    return auth0_Users(auth_domain).userinfo(access_token)
  File "auth0/v3/authentication/users.py", line 25, in userinfo
    return self.get(
  File "auth0/v3/authentication/base.py", line 55, in get
    return self._process_response(response)
  File "auth0/v3/authentication/base.py", line 58, in _process_response
    return self._parse(response).content()
  File "auth0/v3/authentication/base.py", line 79, in content
    raise RateLimitError(error_code=self._error_code(),

Is there some sort of polling happening to check oath token expiration rather than checking it in the app? This guess might be completely off, since I don't know the code that well. :) I think Treeherder requests the token from the front end and passes it to the backend, validates it and uses it for session management (so whenever someone navigates to the app, the user API (in Treeherder) will check the session/auth status).

My memory is really shoddy on this, but as a quick walkthrough, we're talking about requests like this to the backend: https://aus4-admin.mozilla.org/api/users/bhearsum@mozilla.com

Which end up running code like https://github.com/mozilla-releng/balrog/blob/main/src/auslib/util/auth.py#L56, which eventually call auth0 at https://github.com/mozilla-releng/balrog/blob/main/src/auslib/util/auth.py#L53 (if the cache is cold or expired).

Another thing that doesn't help is that we have multiple replicas on the backend -- so even though we're caching in memory on each one, there's multiple instances doing it.

hrm, I'm guessing here that if there was an id token (grabbed from the header) included:

            user_info = jwt.decode(
                id_token,
                rsa_key,
                algorithms=['RS256'],
                audience=AUTH0_CLIENTID,
                access_token=access_token,
                issuer="https://" + AUTH0_DOMAIN + "/",
            )

then user name and email would be in the payload and then there wouldn't be a need to make a call to that API for user info every time unless you need more than just that info? https://github.com/mozilla/treeherder/blob/c60a4bcc0d689362d34e3c67e83ed94f90589e96/treeherder/auth/backends.py#L152

Is there some sort of polling happening to check oath token expiration rather than checking it in the app? This guess might be completely off, since I don't know the code that well. :) I think Treeherder requests the token from the front end and passes it to the backend, validates it and uses it for session management (so whenever someone navigates to the app, the user API (in Treeherder) will check the session/auth status).

My memory is really shoddy on this, but as a quick walkthrough, we're talking about requests like this to the backend: https://aus4-admin.mozilla.org/api/users/bhearsum@mozilla.com
Which end up running code like https://github.com/mozilla-releng/balrog/blob/main/src/auslib/util/auth.py#L56, which eventually call auth0 at https://github.com/mozilla-releng/balrog/blob/main/src/auslib/util/auth.py#L53 (if the cache is cold or expired).
Another thing that doesn't help is that we have multiple replicas on the backend -- so even though we're caching in memory on each one, there's multiple instances doing it.

hrm, I'm guessing here that if there was an id token (grabbed from the header) included:

            user_info = jwt.decode(
                id_token,
                rsa_key,
                algorithms=['RS256'],
                audience=AUTH0_CLIENTID,
                access_token=access_token,
                issuer="https://" + AUTH0_DOMAIN + "/",
            )

then user name and email would be in the payload and then there wouldn't be a need to make a call to that API for user info every time unless you need more than just that info? https://github.com/mozilla/treeherder/blob/c60a4bcc0d689362d34e3c67e83ed94f90589e96/treeherder/auth/backends.py#L152

Without diving deeply into things again, that seems pretty plausible. The Balrog code needs refactoring to make it possible, and we need to make sure we don't disable verification of the tokens (which requires talking to auth0 AFAIK) in places where it does matter.