Takes too much VRAM to transcribe audios

Question

Takes too much VRAM to transcribe audios

Opened this issue 2 months ago · 6 comments

Hi,

Thank you for your work.

I tested CrisperWhisper today with an audio of 2 minute duration on a NVIDAI A100 GPU. The model VRAM footprint is only 3.5GB which is great. However, when processing a 2 minute audio, I get a CUDA out of memory error as the GPU usage goes above 40 GB.

Is this something that will be fixed soon. If not, what would be the best solution to handle long audios.

Answer 1 · 2024-11-21T16:49:46.000Z

How are you running this exactly. 40GB should definately be more than sufficient VRAM :)

Have you tried running the example code from the repo like this replacing the 'your_audio_path.mp3' with your actual audio?

`import os
import sys
import torch

from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
#from utils import adjust_pauses_for_hf_pipeline_output

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "nyrahealth/CrisperWhisper"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=16,
return_timestamps='word',
torch_dtype=torch_dtype,
device=device,
)

hf_pipeline_output = pipe('your_audio_path.mp3')
print(hf_pipeline_output)`

If this does not solve your issue please send me some code so i can reproduce the issue :)

Answer 2 · 2024-11-21T17:30:28.000Z

Thank you for your response. I have given my code below.

I am running it in google Collab on a A100 GPU. I am using the same code that you sent after installing required libraries and logging into HuggingFace. I get a CUDA OOM error when transcribing a 2 minute audio.

!pip install torch torchaudio
!pip install transformers
!pip install accelerate
!huggingface-cli login

#From Laurin
import os
import sys
import torch

#from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
#from utils import adjust_pauses_for_hf_pipeline_output

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "nyrahealth/CrisperWhisper"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=16,
return_timestamps='word',
torch_dtype=torch_dtype,
device=device,
)
hf_pipeline_output = pipe('/content/2min.wav')
print(hf_pipeline_output)

Answer 3 · 2024-11-22T07:09:41.000Z

Same on A100 80G:

python transcribe.py --f audio.aac

An error occurred while transcribing the audio: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 164.75 MiB is free. Including non-PyTorch memory, this process has 78.97 GiB memory in use. Of the allocated memory 73.51 GiB is allocated by PyTorch, and 4.96 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Answer 4 · 2024-11-22T13:20:40.000Z

okay, lowering the batch size ( for example 1 in the extreme case) to fit your GPU size and/or adjusting the beam size should resolve your issue. Could you try this out and let me know how it went?

number of beams you can adjust by using the generate_kwargs argument:
hf_pipeline_output = pipe('/content/2min.wav', generate_kwargs = {"num_beams": 1})

Answer 5 · 2024-11-27T19:25:44.000Z

I am getting OOM on A100 40GB with batch size 1 for 5 min audios. I took the example code and changed batch_size to 1

Answer 6 · 2024-12-29T12:45:22.000Z

With the code below, a batch size of 1 gives the following error:
RuntimeError: The expanded size of the tensor (4) must match the existing size (5) at non-singleton dimension 0. Target sizes: [4]. Tensor sizes: [5]
With a batch size of 2:
RuntimeError: The expanded size of the tensor (6) must match the existing size (7) at non-singleton dimension 0. Target sizes: [6]. Tensor sizes: [7]
With a batch size of 3:
RuntimeError: The expanded size of the tensor (7) must match the existing size (8) at non-singleton dimension 0. Target sizes: [7]. Tensor sizes: [8]
With a batch size of 4, it gives the out of memory error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 104.00 MiB. GPU 0 has a total capacity of 15.89 GiB of which 9.12 MiB is free. Process 20315 has 15.88 GiB memory in use. Of the allocated memory 14.53 GiB is allocated by PyTorch, and 1.06 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Here's the code. It's as above, except with generate_kwargs = {"num_beams": 1}:

import os
import sys
import torch

from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from utils import adjust_pauses_for_hf_pipeline_output

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "nyrahealth/CrisperWhisper"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=10,
batch_size=4,
return_timestamps='word',
torch_dtype=torch_dtype,
device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
hf_pipeline_output = pipe("/kaggle/input/idkuney/idkuney.aac", generate_kwargs = {"num_beams": 1})
crisper_whisper_result = adjust_pauses_for_hf_pipeline_output(hf_pipeline_output)
print(crisper_whisper_result)