mobiusml/aana_sdk

Add rate limiting/backoff based on SDK usage to the sync endpoints.

Closed this issue · 5 comments

Enhancement Description

  • Overview of the enhancement
    When the SDK reaches its usage limit and cannot serve requests synchronously anymore, it should return a backoff message to the user.

Advantages

  • Benefits of implementing this enhancement
  1. When the SDK is shipped for users to test, they receive feedback on usage and know to wait before sending more requests.
  2. The UI can give feedback to the user on the usage instead of the experience breaking.

How is this solved normally by other projects? Add links

Review of four possibilities I found for access limiting with Ray Serve:

Config only:
1. Throttling: using declared deployment/machine resources. Doesn't work, only affects the number of possible deployments, not how they handle load.
2. Throttling: setting target_num_ongoing_concurrent_requests. Doesn't work, limits concurrent executions, but excess is queued instead of returning a 429.
Code solutions:
3. Rate limiting: add a decorator to the deployment inference function that implements rate limiting. Doesn't work. The rate limited calls still wait for tge earlier, non-rate limited ones to complete before erroring.
4. Rate limiting: Custom RequestHandler that implements e.g. leaky bucket algorithm. Will work, but probably the most work.

How do we decide when we have to refuse requests?

Ideal scenario would be to decide based on runtime characteristics (something like given X models, Y GPUs, and Z expected execution time, we limit to Y/X requests per Z time) or even adjust rate limits while running, but for now we'll just use manually configured values.

@evanderiel This is done, right? Can we close the issue?

More could always be done, but yes