Feature Request: weighted rate limits

Question

Feature Request: weighted rate limits

Closed this issue 10 months ago · 8 comments

Cool library. I was thinking it would be useful if instead of 1 call = 1 request calculation for the rate limit, there could be an optional "weight". This would probably only be applicable for the try_acquire method. The use case is for calling openai, they have both a QPS rate limit and a "token" limit. Tokens are basically the number of words in text. So their rate limit is something like 3k queries per minute, and 250k tokens (words) per minute. I want to use this library to do both of those. If I could make each item in the bucket have a "weight" (which is number of tokens in the text in this example), then I think the rest of the library should work as is. Right now I think i can hack around it but would be a cool feature.

Answer 1 · 2023-04-23T08:59:26.000Z

Sound like a good idea. I'll give it a go. Thanks

Answer 2 · 2023-04-29T10:29:18.000Z

this will be reserved for the next major version

Answer 3 · 2023-05-03T18:14:10.000Z

Note that in the OpenAI API case, you only will truly know how many tokens were consumed by a request after the request has succeeded.

Therefore, you can only guess how much capacity you will need beforehand, but there should be a way of notifying the bucket of how much capacity was actually consumed after each request.

Answer 4 · 2023-05-03T18:31:09.000Z

Good point @dekked . Typically you can get a decent bound ahead of time as you can calculate the context side of the tokens, and then you will set a maximum for the completion/generation side when you send the requst

Answer 5 · 2023-05-04T01:01:21.000Z

Actually I think what would be better for that case is using the decent bound guess for the first request only, and then using what you last got returned by OpenAI as of the exact number of tokens consumed. If you do use max_tokens, even better as you can pretty much count everything beforehand.

Answer 6 · 2023-07-06T20:44:04.000Z

For what it's worth, OpenAI has an example of a script for bulk analysis that handles errors and rate limits: https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py

My understanding of their rate limit implementation is that the num_tokens_consumed_from_request function uses the API call's max_tokens value as the consumption amount for a request, and the token capacity counter is decremented by that amount.

That should be an upper bound and so avoid rate limit errors, but an approach like @dekked suggests (using the exact # tokens consumed by the request) would be better at maximizing the available throughput.

Answer 7 · 2023-07-06T20:47:08.000Z

In terms of API design for pyratelimiter, being able to apply "negative weights" to the buckets would support the use case when the estimate is too high. But to be fully general, it might be nice to be able to forcibly add usage to the buckets (like, try_acquire except it always succeeds). That way if your estimate was too low, you can let the limiter know without blocking.

pseudocode:

def make_request(req):
    num_tokens_estimate = calc_num_tokens(req)
    while True:
        try:
            limiter.try_acquire(num_tokens_estimate)
        except BucketFullException:
            time.sleep(1)
    result = api_request.Create(req)
    # positive if initial estimate was too low, negative if estimate was too high
    usage_diff = result.tokens_used  - num_tokens_estimate
    limiter.force_add_usage(usage_diff)
    
    return result

Answer 8 · 2023-08-27T19:40:01.000Z

Resolved in the new major release (v3.0.0)