Server to client media playback with frame-based processing

Many of the examples in this repo show client to server media sinks (mic / video capture), which have frame based callback processing. I am looking to do server to client media playback, with frame based callback processing. This would be useful for real-time audio playback with real-time processing.

After searching through this discussion https://discuss.streamlit.io/t/new-component-streamlit-webrtc-a-new-way-to-deal-with-real-time-media-streams/8669, and the example pages in streamlit-webrtc, I have not been able to find an example of this.

To be specific, I am looking to do the following:

Load an audio file (server)
Start playback (from server to client), frame by frame
Process each frame (before it is sent to the client) via a callback (processing should occur on the server, for example ML inference)
Playback processed audio frame to client
Continue in real-time

This example uses the MediaPlayer class from aiortc:

streamlit-webrtc/pages/8_media_files_streaming.py

Line 9 in ff697dc

from aiortc.contrib.media import MediaPlayer

. However it does not seem that this provides any sort of callback on the stream (at the audio frame level).

Digging deeper, the MediaPlayer class has a MediaStreamTrack instance (https://aiortc.readthedocs.io/en/latest/api.html#aiortc.MediaStreamTrack) which has a recv callback method for each frame.

Would the correct approach be to create a new subclass of MediaStreamTrack and write a custom recv for the required processing? I found this related thread: aiortc/aiortc#571

Is this functionality supported currently? I would appreciate any guidance here.

Thanks heaps!

Interested in pointers to potential solutions. My use case is similar, generate some text-to-speech and play it back.

The example you mentioned (https://github.com/whitphx/streamlit-webrtc/blob/main/pages/8_media_files_streaming.py) uses a callback to process the video frames (video_frame_callback).
Does using audio_frame_callback in this place instead work for you?

The audio filter example may also be a reference about the usage of audio callback while it's a client-to-server example.

Interested in pointers to potential solutions. My use case is similar, generate some text-to-speech and play it back.

@wenshutang I'm also interested in this issue for my text-to-speech streaming problem. I wonder if you've found a solution yet. If so, could you please share some references or suggestions? Thank you.