[Bug]: Cannot get past 50 RPS

Question

[Bug]: Cannot get past 50 RPS

Opened this issue 2 months ago · 5 comments

What happened?

I have OpenAI tier 5 usage, which should give me 30,000 RPM = 500 RPS with "gpt-4o-mini". However I struggle get past 50 RPS.

The minimal replication:

from litellm import acompletion

tasks = [acompletion(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You're an agent who answers yes or no"},
      {"role": "user", "content": "Is the sky blue?"},
    ],
) for i in range(2000)]

I only get 50 items/second as opposed to ~500 items/second when sending raw HTTP requests.

Relevant log output

 16%|█████████████████████▌                                                                                                                 | 320/2000 [00:09<00:40, 41.49it/s]

Twitter / LinkedIn details

No response

Answer 1 · 2024-11-14T16:28:06.000Z

hi @vutrung96 looking into this, how do you get the % complete log output ?

Answer 2 · 2024-11-16T18:39:42.000Z

Hi @ishaan-jaff I was just using tqdm

Answer 3 · 2024-11-19T03:24:34.000Z

Hi @ishaan-jaff , any updates on this, also facing this issue!

Answer 4 · 2024-11-21T16:31:41.000Z

hi @vutrung96 @CharlieJCJ do you see the issue on litellm.router too ? https://docs.litellm.ai/docs/routing

It would help me if you could test with litellm router too

Answer 5 · 2024-12-04T14:48:06.000Z

Hi @ishaan-jaff
We tracked down the root cause of the issue.

Litellm uses the official OpenAI python client

client: Optional[Union[OpenAI, AsyncOpenAI]] = None,

The official OpenAI client has performance issues with high numbers of concurrent requests due to issues in httpx

openai/openai-python#1596

The issues in httpx are due to a number of factors related to anyio vs asyncio

encode/httpx#3215

Which are addressed in the open PRs below

We saw this when implementing litellm as the backend for our synthetic data engine

bespokelabsai/curator#141

When using our own openai client (with aiohttp instead of httpx) we saturate the highest rate limits (30,000 requests per minute on gpt-4o-mini tier 5). When using litellm, the performance issues cap us well under the highest rate limit (200 queries per second - 12,000 requests per minute).