KoljaB/RealtimeSTT

Passing audio bytes (Frames) to the AudioToTextRecorder

Closed this issue · 10 comments

I'm building a project to build a simple sip client to accept user calls and stream the voice to STT, but I searched a lot in Github to find a sample or something related to my case (I didn't find any result till now)

Python version: 3.9
STT: 0.1.16

All I need is how to pass frames from SIP to the STT module to get a voice of the user as text.

def onFrameReceived(self, frame):
        # Process the incoming frame here
        print("frame_received")
        print(frame.size)
        byte_data = [frame.buf[i] for i in range(frame.buf.size())]
        # Convert 1-byte values to signed 16-bit values
        int_data = [struct.unpack('<h', bytes(byte_data[i:i+2]))[0] for i in range(0, len(byte_data), 2)]
        print(int_data)
        
        # convert frames to text here

I just found a method called feed_audio but I don't know how to use it (I mean how to pass byte_data in the code above to this method)

Just pass the audio chunks one by one to the feed_audio method. Chunks need to be feeded in realtime and have to be 16000 Hz mono.

Hi
Could you explain how to pass the data sent through feed_audio on to transcription? I can't find how/where it's done

And also if you could tell me how to handle this because I tried a lot but I didn't get any result

I tried to use the below code to convert it to a valid chunk

     full_sentences = []

  def process_text(self, text):
        self.full_sentences.append(text)
        self.text_detected("")


    def text_detected(self, text):
            global displayed_text
            sentences_with_style = [
                f"{Fore.YELLOW + sentence + Style.RESET_ALL if i % 2 == 0 else Fore.CYAN + sentence + Style.RESET_ALL} "
                for i, sentence in enumerate(self.full_sentences)
            ]
            new_text = "".join(sentences_with_style).strip() + " " + text if len(sentences_with_style) > 0 else text

            if new_text != displayed_text:
                displayed_text = new_text
                self.clear_console()
                print(displayed_text, end="", flush=True)

    def clear_console():
        os.system('clear' if os.name == 'posix' else 'cls')


  def onFrameReceived(self, frame):
        # Process the incoming frame here
        print("frame_received")
        byte_data = [frame.buf[i] for i in range(frame.buf.size())]
        int_data = [struct.unpack('<h', bytes(byte_data[i:i+2]))[0] for i in range(0, len(byte_data), 2)]
        chunk = self.decode_and_resample(bytes(byte_data), 16000, 16000)
        print(chunk)

    def decode_and_resample(
            self,
            audio_data,
            original_sample_rate,
            target_sample_rate):

        # Decode 16-bit PCM data to numpy array
        audio_np = np.frombuffer(audio_data, dtype=np.int16)

        # Calculate the number of samples after resampling
        num_original_samples = len(audio_np)
        num_target_samples = int(num_original_samples * target_sample_rate /
                                 original_sample_rate)

        # Resample the audio
        resampled_audio = resample(audio_np, num_target_samples)
        return resampled_audio.astype(np.int16).tobytes()

and this is a snippet from terminal:

frame_received
b'\xf9\xff\xfe\xff\x07\x00\n\x00\x06\x00\x04\x00\t\x00\r\x00\x04\x00\xfc\xff\xfb\xff\x00\x00\x03\x00\x00\x00\xfb\xff\xfc\xff\x05\x00\x0b\x00\t\x00\x07\x00\x06\x00\x06\x00\t\x00\t\x00\x04\x00\x02\x00\xf7\xff\xf3\xff\xf9\xff\x02\x00\x05\x00\xfe\xff\xfb\xff\xfc\xff\x05\x00\n\x00\t\x00\x05\x00\x05\x00\t\x00\t\x00\x06\x00\x04\x00\x07\x00\t\x00\t\x00\x05\x00\x03\x00\x08\x00\x11\x00\x16\x00\x12\x00\t\x00\x03\x00\x05\x00\t\x00\x07\x00\x07\x00\x07\x00\x07\x00\x07\x00\x06\x00\t\x00\x07\x00\x05\x00\x05\x00\t\x00\x0b\x00\x04\x00\xfd\xff\xfd\xff\x00\x00\x04\x00\x00\x00\xfd\xff\xfd\xff\x03\x00\t\x00\r\x00\x08\x00\x04\x00\xfe\xff\xfb\xff\xfe\xff\x04\x00\x0b\x00\n\x00\x05\x00\x05\x00\t\x00\x07\x00\x07\x00\x06\x00\x06\x00\x08\x00\x06\x00\x06\x00\x06\x00\x05\x00\x07\x00\x08\x00\x07\x00\x05\x00\x00\x00\xfa\xff\xf6\xff\xf5\xff\xff\xff\t\x00\x0c\x00\x05\x00\xfc\xff\xfa\xff\x07\x00\x16\x00\x17\x00\x08\x00\xfa\xff\xf7\xff\x00\x00\x07\x00\x08\x00\x06\x00\x07\x00\x06\x00\x07\x00\x06\x00\x06\x00\x05\x00\x06\x00\x08\x00\t\x00\x04\x00\x04\x00\x08\x00\x0b\x00\x05\x00\xfd\xff\xfa\xff\x00\x00\x05\x00\x02\x00\xf9\xff\xf9\xff\x05\x00\x16\x00\x17\x00\x0f\x00\x07\x00\x04\x00\x05\x00\t\x00\t\x00\x06\x00\x04\x00\x04\x00\x08\x00\x0b\x00\x05\x00\xfb\xff\xf9\xff\x02\x00\x04\x00\x00\x00\xf8\xff\xfb\xff\t\x00\x0e\x00\x05\x00\xfb\xff\xfc\xff\x01\x00\x03\x00\xfc\xff\xfc\xff\x02\x00\x01\x00\xf8\xff\xec\xff\xf2\xff\x01\x00\r\x00\x0b\x00\x03\x00\x03\x00\x08\x00\t\x00\x06\x00\x04\x00\x06\x00\x06\x00\x07\x00\x07\x00\t\x00\x06\x00\x00\x00\xf9\xff\xf6\xff\xf3\xff\xf7\xff\xfc\xff\xff\xff\x03\x00\x08\x00\x0b\x00\x08\x00\x04\x00\x04\x00\x0b\x00\x14\x00\x14\x00\t\x00\xfa\xff\xf3\xff\xf4\xff\xfa\xff\xfb\xff\xf8\xff\xf3\xff\xf7\xff\xfb\xff\xfe\xff\x05\x00\x08\x00\x0b\x00\x01\x00\xf6\xff\xf4\xff\xf9\xff\x03\x00\x05\x00\xff\xff\xfa\xff\xf7\xff\xf5\xff\xf7\xff\xf9\xff\xff\xff\x06\x00\x03\x00\xf9\xff\xea\xff\xe8\xff\xf0\xff\xf8\xff\xf9\xff\xfa\xff\xf5\xff\xf8\xff\xf7\xff\xf9\xff\xf8\xff\xf7\xff\xf6\xff\xfa\xff\x00\x00\x04\x00\x08\x00\x08\x00\x07\x00\x05\x00\x08\x00\x08\x00\x07\x00\x05\x00\x06\x00\t\x00\x08\x00\x04\x00\x04\x00\x0b\x00\x0c\x00\x03\x00\xf4\xff\xea\xff\xf4\xff\x05\x00\r\x00\t\x00\x03\x00\x07\x00\t\x00\x07\x00\x05\x00\t\x00\x0b\x00\x05\x00\xfd\xff\xfa\xff\x00\x00\x03\x00\x02\x00\xfd\xff\xfc\xff\x04\x00\n\x00\x0c\x00\x07\x00\x03\x00\x04\x00\r\x00\x0b\x00\x02\x00\xfc\xff\xfc\xff\x07\x00\x14\x00\x14\x00\t\x00\x02\x00\x05\x00\x0b\x00\x08\x00\x04\x00\x07\x00\n\x00\x06\x00\xfe\xff\xf8\xff\xfd\xff'

but when I add these two line at the end of onFrameReceived method

self.recorder.feed_audio(chunk)
self.recorder.text(self.process_text)

I got only the first chunk printed and the stop working because of this line
self.recorder.text(self.process_text)

Do you have any idea about this scenario or what I need to do?

feed_audio and text methods should run in different threads. See example implementation...

I really want to thank you for your help, I did it 😃

@galgreshler yes I think you can see this sample

recorder_config = {

Another question please @KoljaB

Is it possible to get word by word using this approach because I need to wait until the user finishes his talk then I got the whole sentance after that ?
For example this is the current output ( STT_sentence: So could you tell me what is the color of the sky?)
but I need to get it in real time and fast also like this

So
could
you
me
...

this is the config for STT

    recorder_config = {
        'use_microphone': False,
        'spinner': False,
        'model': 'large-v2',
        'language': 'en',
        'silero_sensitivity': 0.4,
        'webrtc_sensitivity': 2,
        'post_speech_silence_duration': 0.4,
        'min_length_of_recording': 0,
        'min_gap_between_recordings': 0,
        'enable_realtime_transcription': True,
        'realtime_processing_pause': 0.2,   
        'realtime_model_type': 'tiny.en',
        }
    def recorder_thread(self):
        global recorder
        print("Initializing RealtimeSTT...")
        recorder = AudioToTextRecorder(**self.recorder_config)
        print("RealtimeSTT initialized")
        self.recorder_ready.set()
        while True:
            stt_sentence = recorder.text()
            print(f"\r STT_sentence: {stt_sentence}")

Whisper requires context to maintain transcription quality. It can't reliably transcribe individual words as they are spoken. Therefore RealtimeSTT provides the entire audio context to whisper until it detects the end of a sentence (using Voice Activity Detection).

You can approximate word-by-word updates using the on_realtime_transcription_update or on_realtime_transcription_stabilized callbacks. These get the current running sentence. Then you can extract the latest word from the ongoing transcription:

	if current_text and current_text != last_text:
		new_realtime_text_fragment = current_text[len(last_text):]

This approach isn't entirely reliable because the initial characters of the current realtime text may change with each transcription update (are not guaranteed to be stable across multiple transcriptions that occur while the current sentence grows), which leads to instability in detecting new words.

For true word-by-word transcription, you'd need to extract word timestamps (w StableWhisper or faster_whisper which RealtimeSTT uses). Then trim the audio by already detected words. Precise word level timestamp extraction demands significant GPU power though and the trimming process impacts transcription quality, so because of these trade-offs RealtimeSTT does not implement this.

Thanks for these details.