huggingface/distil-whisper

Speculative Decoding: TypeError: list indices must be integers or slices, not tuple (Apple M1 MacOS Sonoma 14.6.1)

solitaryangler opened this issue · 0 comments

Hi,

I am trying to run Speculative Decoding from the example given here: huggingface.co/distil-whisper/distil-large-v2#speculative-decoding. I'm using the code:

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample, return_timestamps=True)
print(result["text"])

My environment has: python-3.10.13 with (non-exhaustive list)

torch==2.6.0.dev20240925
torchaudio==2.5.0.dev20240925
torchvision==0.20.0.dev20240925
ffmpeg-python==0.2.0
future==1.0.0
librosa==0.10.2.post1
transformers==4.45.0
accelerate==0.34.2

I am running everything on an Apple M1 chip with MacOS Sonoma 14.6.1.

I am getting the following error:

miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py:496: FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
Traceback (most recent call last):
  File "test_specdec.py", line 41, in <module>
    result = pipe(sample, return_timestamps=True)
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 284, in __call__
    return super().__call__(inputs, **kwargs)
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1260, in __call__
    return next(
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1175, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 512, in _forward
    tokens = self.model.generate(
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py", line 671, in generate
    ) = self.generate_with_fallback(
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py", line 834, in generate_with_fallback
    seek_outputs = super().generate(
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 1992, in generate
    result = self._assisted_decoding(
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 4015, in _assisted_decoding
    candidate_input_ids, candidate_logits = candidate_generator.get_candidates(input_ids)
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/generation/candidate_generator.py", line 207, in get_candidates
    self.assistant_kwargs["past_key_values"] = _crop_past_key_values(
  File "miniconda3/envs/py3.10.13/lib/python3.10/site-packages/transformers/generation/candidate_generator.py", line 404, in _crop_past_key_values
    past_key_values[idx][0][:, :, :max_length, :],
TypeError: list indices must be integers or slices, not tuple

Kindly help!
Thanks.