Why does the transcription speed significantly decrease when the WhisperModel instance is wrapped inside a class attribute?

Question

Why does the transcription speed significantly decrease when the WhisperModel instance is wrapped inside a class attribute?

Closed this issue 2 months ago · 4 comments

Environment: Windows 10, Python 3.10, torch==2.1.2+cu121

In the following code, theoretically, the instances of a and b have exactly the same functionality. However, in the instance of a, the WhisperModel exists as a class attribute. But in actual testing, the execution speed of the two methods differs by more than 10 times. Why?

import time

from faster_whisper import WhisperModel


class Whisper:
    def __init__(self, model="whisper-medium"):
        # Run on GPU with FP16
        self.model = WhisperModel(model, device="cuda", compute_type="float16")

    def get_text(self, audio_path):
        segments, info = self.model.transcribe(audio_path, vad_filter=True)
        return "".join([segment.text for segment in segments])


model = r".\model\whisper-medium"
audio_path = r".\test.wav"
a = Whisper(model)
b = WhisperModel(model, device="cuda", compute_type="float16")

now = time.time()
b.transcribe(audio_path, vad_filter=True)
print(time.time() - now)

now = time.time()
a.get_text(audio_path)
print(time.time() - now)

Answer 1 · 2024-08-14T10:50:06.000Z

Try to reverse the order i.e. measure first a.get_text() and then b.transcribe(). My guess is that the second computation will always be faster due to low level GPU caches.

Answer 2 · 2024-08-14T12:00:18.000Z

Try to reverse the order i.e. measure first a.get_text() and then b.transcribe(). My guess is that the second computation will always be faster due to low level GPU caches.

@dorinclisu thanks, It didn't work, the phenomenon is still the same after changing the order. For a 2-hour audio, using method b only takes 49 seconds, while using method a takes 831 seconds.

Answer 3 · 2024-08-14T13:29:15.000Z

this is not transcription

now = time.time()
b.transcribe(audio_path, vad_filter=True)
print(time.time() - now)

this is the correct transcription:

now = time.time()
"".join(segment.text for segment in b.transcribe(audio_path, vad_filter=True)[0])
print(time.time() - now)

transcription is done in a lazily manner, please check the readme

Answer 4 · 2024-08-14T14:48:06.000Z

@MahmoudAshraf97 Yes, you are right!

this is not transcription

now = time.time()
b.transcribe(audio_path, vad_filter=True)
print(time.time() - now)

this is the correct transcription:

now = time.time()
"".join(segment.text for segment in b.transcribe(audio_path, vad_filter=True)[0])
print(time.time() - now)

transcription is done in a lazily manner, please check the readme

Yes, you are right!