Only seems to work for me with `--no-vad` and `--allow-padding`
expenses opened this issue · 2 comments
Hey, thanks for making this! I was looking around for something that did live STT and this seems to work well!
Reading through the code, I'm very confused by the allow_padding
variable. I couldn't get the code to work at all without --allow-padding
. Maybe document what this code is doing?
whispering/whispering/transcriber.py
Lines 264 to 272 in 9123181
Additionally, and maybe this is because my mic isn't loud enough, the VAD didn't seem to work super well. I got it working for a bit at the start of recording when I had --allow-padding
but then it seemed to report 'No speech' no matter how loudly I spoke. I'll have to try and adjust my mic volume to see if I can fix that.
Logs
Here's a section of logging:
[2022-10-14 16:42:33,000] transcriber._deal_timestamp:227 DEBUG -> Length of buffer: 8
[2022-10-14 16:42:33,000] transcriber.transcribe:319 DEBUG -> new seek=3000, mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:33,000] transcriber.transcribe:322 DEBUG -> ctx.buffer_mel is None (torch.Size([80, 375]), 3000)
[2022-10-14 16:42:35,730] cli.transcribe_from_mic:75 DEBUG -> Audio #: 2, The rest of queue: 0
[2022-10-14 16:42:35,730] cli.transcribe_from_mic:90 DEBUG -> Got. The rest of queue: 0
[2022-10-14 16:42:35,730] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-14 16:42:35,733] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:35,733] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:35,733] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-14 16:42:35,733] transcriber.transcribe:265 DEBUG -> mel.shape (375) - seek (0) < N_FRAMES (3000)
[2022-10-14 16:42:35,733] transcriber.transcribe:269 WARNING -> Padding is not expected while speaking
[2022-10-14 16:42:35,733] transcriber.transcribe:280 DEBUG -> seek=0, timestamp=24.0, mel.shape: torch.Size([80, 375]), segment.shape: torch.Size([80, 3000])
[2022-10-14 16:42:35,734] transcriber._decode_with_fallback:103 DEBUG -> DecodeOptions: DecodingOptions(task='transcribe', language='en', temperature=0.0, sample_len=None, best_of=None, beam_size=5, patience=None, length_penalty=None, prompt=[50363, 18435, 11, 3387, 670, 13, 50463, 50463], prefix=None, suppress_blank=True, suppress_tokens='-1', without_timestamps=False, max_initial_timestamp=1.0, fp16=False)
[2022-10-14 16:42:37,208] transcriber.transcribe:288 DEBUG -> Result: temperature=0.00, no_speech_prob=0.06, avg_logprob=-0.66
[2022-10-14 16:42:37,209] transcriber._deal_timestamp:201 DEBUG -> Length of consecutive: 0, timestamps: tensor([50363, 50713])
[2022-10-14 16:42:37,209] transcriber._deal_timestamp:212 DEBUG -> segment_duration: 30.0, Duration: 7.0
Environment
- OS: Arch Linux 5.19.13
- Python Version: 3.10.8
- Whispering version: 9123181
Hello, thank you for the report!
First, I fixed the document because the document is not clear. (ae1dbd7)
Currently 30 seconds speech segments are needed to get Whiwper analysis.
This means that 8 intervals of 3.75 seconds must be judged by VAD to have speech.
I will improve the behaviors (#13).
Please use larger number for -n
is 3.75 second is too short to analyze VAD.
# VAD for every 7.5 seconds
whispering --language en --model tiny -n 40
If you still have questions, please feel free to reopen this!