no_speech_probablity

Question

no_speech_probablity

rizwanishaq opened this issue 16 days ago · 5 comments

          The `pipeline` is designed to be a high-level wrapper that goes from audio inputs -> text outputs. Anytime we want something more granular than that, it's best to use the `model` + `processor` API:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import torch

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")

librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = librispeech_dummy[0]["audio"]

input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

outputs = model.generate(
    input_features, output_scores=True, return_dict_in_generate=True, max_new_tokens=128
)

transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)

pred_text = processor.batch_decode(outputs.sequences, skip_special_tokens=True)
pred_language = processor.batch_decode(outputs.sequences[:, 1:2], skip_special_tokens=False)
lang_prob = torch.exp(transition_scores[:, 0])

print(pred_text)
print(pred_language)
print(lang_prob)

Print Output:

[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
['<|en|>']
tensor([1.])

Originally posted by @sanchit-gandhi in #25138 (comment)

How we can get the no_speech_probablity with this code?

Answer 1 · 2024-05-13T14:01:40.000Z

cc @sanchit-gandhi @ylacombe @kamilakesbi

Answer 2 · 2024-05-13T15:44:32.000Z

Hey @rizwanishaq, you can simply add the no_speech_threshold argument to the generate method:

outputs = model.generate(
    input_features, output_scores=True, return_dict_in_generate=True, max_new_tokens=128, no_speech_threshold=0.2
)

Let me know if that works!

Answer 3 · 2024-05-13T15:49:41.000Z

Hey @ylacombe I get this warning "Audio input consists of only 3000. Short-form transcription is activated.no_speech_threshold is set to 0.3, but will be ignored."

Answer 4 · 2024-05-13T16:23:36.000Z

Then it's related to a shortcoming of our Whisper implementation that we hope to fix soon: we're not applying some of the features used for long-form generation to short audios.

We should be resolving this issue quite soon, I'll keep you up-to-date

Answer 5 · 2024-05-13T16:27:59.000Z

This is a duplicate of #29595, I'll close this issue, let's talk in #29595 if you have further questions!