livekit/python-sdks

AudioSource.capture_frame is slow

saghul opened this issue ยท 8 comments

๐Ÿ‘‹ Hey there!

I've been prototyping with the Python SDK to build an audio gateway between another realtime system and LiveKit. I am basically creating a new AudioSource and feeding frames to it, similar to what the publish_wave example does, but rather than having a generated sine wave, I am feeding it audio frames.

The problem I'm running into is that the function takes too long to run. I added some logs with precise differences, and I'm measuring it takes somewhere between 6 and 30ms to run. This is a problem because audio will end up very delayed very quickly, as the delay compounds.

Initially I thought I might have been copying buffers not too efficiently, but the publish_wave example shows the same symptoms.

Here is how I measured:

        t0 = perf_counter_ns()
        await source.capture_frame(audio_frame)
        t1 = perf_counter_ns()
        print(f'XX capture frame' {(t1-t0)/1e6:.3f}')

And here is some output (this was on macOS, FWIW):

XX capture frame 13.939
XX capture frame 15.069
XX capture frame 19.562
XX capture frame 11.712
XX capture frame 8.328
XX capture frame 10.812
XX capture frame 11.314
XX capture frame 7.734
XX capture frame 12.327
XX capture frame 10.045
XX capture frame 5.673
XX capture frame 12.810
XX capture frame 24.589

I also tried using 20ms audio chunks, but that doesn't seem to change things much, so I'm out of ideas. Is there a way to make this faster? Would using Go work better for this scenario?

Thanks in advance! ๐Ÿ™

Hey @saghul the time required to capture an audio frame is as expected.

We maintain an internal buffer of 50ms. This means that if you push more than that (when being faster than real-time), the async function will cause the code to wait. This is not a performance issue, but rather an API decision.

Example:

  • push 50ms - wait 0
  • push 10ms - wait 10ms
  • push 10ms - wait 10ms

This is particulary useful when publishing a file for example:

while not file.eos():
  data = file.read(1024)
  audio_frame = ....
  await source.capture_frame(audio_frame)

Note that once the buffer receives data, it is immediately forwarded to the libwebrtc internals. We do not wait for the 50ms buffer to fill.

The primary reason for having this buffer is that the asyncio event loop in Python can sometimes be slow and libwebrtc expects chunks of 10ms. It's not reliable to wait 10ms between our chunks on the Python side. Hence, the Rust layer is handling this for us with this buffer since we can have more accurate timers.

Hum, interesting. My audio source is also in realtime and comes over a WebSocket. What I'm trying to do is to feed it to LK. It seems like some pacing would be necessary here then.

I'll try to put the 20ms chunks I get from the WS on a queue and have a task that picks chunks and feeds them to the source.

Do you reckon that's the right approach here?

Yes I think this is a good approach.
We may add an option to disable the queue and directly feed the frames to libwebrtc. But this is at the risk of the asyncio event loop being too slow on Python. (But this could be done on a separate thread)

Related to livekit/rust-sdks#353

I gave this a try and it worked, somewhat.

Here is the simple frame processor task:

async def process_frame_queue(src: rtc.AudioSource, q: asyncio.Queue):
    while True:
        frame = await q.get()
        if frame is None:
            break
        await src.capture_frame(frame)

If that queue is unbounded, I still get into the compounding delay territory. If I limit the queue size to 25 frames (my frames are 20ms) then I can do something like this:

            for frame in frames:
                f = rtc.AudioFrame(bytes(frame.planes[0]), frame.sample_rate, 1, frame.samples)
                try:
                    frame_queue.put_nowait(f)
                except asyncio.QueueFull:
                    #print('XXX QUEUE FULL, DROPPING FRAME')
                    pass

Without dropping frames capture_frame is still not fast enough to consume the frames I'm passing in real time.

Is there anything I can provide to help debug this further?

In the PR you linked it seems like the Rust SDK is polling for frames at a given intervcal (10ms?) and if it doesn't find one it feeds silence, right? Yeah, disabling that would be ideal IMHO.

I'll try to feed the frames in 10ms chunks instead, maybe that can help?

I'll try to feed the frames in 10ms chunks instead, maybe that can help?

Nah, same thing. My queue in Python fills up quickly, seems like capture_frame cannot drain the buffer fast enough for some reason.

Hey, sorry for the late reply.
Since this PR, if you set queue_size_ms to zero, it'll skip the queue and capture the frames directly.

Tho it requires the frames to be 10ms each

closing this since it's been released.

saghul commented

Thanks!