karpetrosyan/hishel

Caching streams does not stream response in real-time.

Opened this issue · 2 comments

I can use httpx to stream responses from an LLM API as follows:

# ...
 async with httpx.AsyncClient() as client:
                        start_time = time.time()
                        async with client.stream('POST',
                                                endpoint,
                                                headers=headers,
                                                json=json_payload,
                                                timeout=None,
                                                ) as response:
                            partial_response = ''
                            prev_time = time.time()
                            print("time to first chunk: {:.5f}".format(
                                prev_time - start_time))
                            async for chunk in response.aiter_lines():
                                cur_time = time.time()
                                print("{:.5f}".format(
                                    cur_time - prev_time))
                                prev_time = cur_time
                             # ...

aiter_lines() yields new chunks in time intervals like these:

time to first chunk: 0.69247
0.00051
0.00015
0.00044
0.00015
0.00013
0.00053
0.00017
0.00014
0.00016
0.00007
0.00014
0.00013
...

When I use hishel to cache the streams, chunks are not yielded in real-time anymore: Instead, the request now seems to first get fetched in its entirety, after which each chunk is yielded almost instantaneously:

# ...
client = hishel.AsyncCacheClient()
start_time = time.time()
                    async with client.stream('POST',
                                             endpoint,
                                             headers=headers,
                                             json=json_payload,
                                             timeout=None,
                                             ) as response:
                        partial_response = ''
                        prev_time = time.time()
                        print("time to first chunk: {:.5f}".format(
                            prev_time - start_time))
                        async for chunk in response.aiter_lines():
                            cur_time = time.time()
                            print("{:.5f}".format(
                                cur_time - prev_time))
                            prev_time = cur_time
                            # ...
time to first chunk: 1.71885
0.00035
0.00013
0.00001
0.00004
0.00001
0.00003
0.00001
0.00003
0.00001
...

Desired behavior: Streaming with hishel should yield chunks in real-time, just like httpx.

Hey!

At this point, we don't support streaming responses. Hishel will load the entire response into memory and then simulate the streaming process. This is because we support various storage systems and serializers, and there's no consistent way to handle streaming across all of them.

However, I am considering adding a special storage/serializer that will support streaming responses.

However, I am considering adding a special storage/serializer that will support streaming responses.

@karpetrosyan that'd be great!