shirayu/whispering

Only seems to work for me with `--no-vad` and `--allow-padding`

expenses opened this issue · 2 comments

Hey, thanks for making this! I was looking around for something that did live STT and this seems to work well!

Reading through the code, I'm very confused by the allow_padding variable. I couldn't get the code to work at all without --allow-padding. Maybe document what this code is doing?

if mel.shape[-1] - seek < N_FRAMES:
logger.debug(
f"mel.shape ({mel.shape[-1]}) - seek ({seek}) < N_FRAMES ({N_FRAMES})"
)
if ctx.allow_padding:
logger.warning("Padding is not expected while speaking")
else:
logger.debug("No padding")
break

Additionally, and maybe this is because my mic isn't loud enough, the VAD didn't seem to work super well. I got it working for a bit at the start of recording when I had --allow-padding but then it seemed to report 'No speech' no matter how loudly I spoke. I'll have to try and adjust my mic volume to see if I can fix that.

Logs

Here's a section of logging:

[2022-10-14 16:42:33,000] transcriber._deal_timestamp:227 DEBUG -> Length of buffer: 8
[2022-10-14 16:42:33,000] transcriber.transcribe:319 DEBUG -> new seek=3000, mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:33,000] transcriber.transcribe:322 DEBUG -> ctx.buffer_mel is None (torch.Size([80, 375]), 3000)
[2022-10-14 16:42:35,730] cli.transcribe_from_mic:75 DEBUG -> Audio #: 2, The rest of queue: 0
[2022-10-14 16:42:35,730] cli.transcribe_from_mic:90 DEBUG -> Got. The rest of queue: 0
[2022-10-14 16:42:35,730] transcriber.transcribe:235 DEBUG -> 60000
[2022-10-14 16:42:35,733] transcriber.transcribe:252 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:35,733] transcriber.transcribe:259 DEBUG -> mel.shape: torch.Size([80, 375])
[2022-10-14 16:42:35,733] transcriber.transcribe:263 DEBUG -> seek: 0
[2022-10-14 16:42:35,733] transcriber.transcribe:265 DEBUG -> mel.shape (375) - seek (0) < N_FRAMES (3000)
[2022-10-14 16:42:35,733] transcriber.transcribe:269 WARNING -> Padding is not expected while speaking
[2022-10-14 16:42:35,733] transcriber.transcribe:280 DEBUG -> seek=0, timestamp=24.0, mel.shape: torch.Size([80, 375]), segment.shape: torch.Size([80, 3000])
[2022-10-14 16:42:35,734] transcriber._decode_with_fallback:103 DEBUG -> DecodeOptions: DecodingOptions(task='transcribe', language='en', temperature=0.0, sample_len=None, best_of=None, beam_size=5, patience=None, length_penalty=None, prompt=[50363, 18435, 11, 3387, 670, 13, 50463, 50463], prefix=None, suppress_blank=True, suppress_tokens='-1', without_timestamps=False, max_initial_timestamp=1.0, fp16=False)
[2022-10-14 16:42:37,208] transcriber.transcribe:288 DEBUG -> Result: temperature=0.00, no_speech_prob=0.06, avg_logprob=-0.66
[2022-10-14 16:42:37,209] transcriber._deal_timestamp:201 DEBUG -> Length of consecutive: 0, timestamps: tensor([50363, 50713])
[2022-10-14 16:42:37,209] transcriber._deal_timestamp:212 DEBUG -> segment_duration: 30.0, Duration: 7.0

Environment

  • OS: Arch Linux 5.19.13
  • Python Version: 3.10.8
  • Whispering version: 9123181

Hello, thank you for the report!

First, I fixed the document because the document is not clear. (ae1dbd7)

Currently 30 seconds speech segments are needed to get Whiwper analysis.
This means that 8 intervals of 3.75 seconds must be judged by VAD to have speech.
I will improve the behaviors (#13).

Please use larger number for -n is 3.75 second is too short to analyze VAD.

# VAD for every 7.5 seconds
whispering --language en --model tiny -n 40

If you still have questions, please feel free to reopen this!

I will also add VAD threshold option (86f38c6) in next release.