Why does the transcription speed significantly decrease when the WhisperModel instance is wrapped inside a class attribute?
Closed this issue · 4 comments
Environment: Windows 10, Python 3.10, torch==2.1.2+cu121
In the following code, theoretically, the instances of a
and b
have exactly the same functionality. However, in the instance of a
, the WhisperModel exists as a class attribute. But in actual testing, the execution speed of the two methods differs by more than 10 times. Why?
import time
from faster_whisper import WhisperModel
class Whisper:
def __init__(self, model="whisper-medium"):
# Run on GPU with FP16
self.model = WhisperModel(model, device="cuda", compute_type="float16")
def get_text(self, audio_path):
segments, info = self.model.transcribe(audio_path, vad_filter=True)
return "".join([segment.text for segment in segments])
model = r".\model\whisper-medium"
audio_path = r".\test.wav"
a = Whisper(model)
b = WhisperModel(model, device="cuda", compute_type="float16")
now = time.time()
b.transcribe(audio_path, vad_filter=True)
print(time.time() - now)
now = time.time()
a.get_text(audio_path)
print(time.time() - now)
Try to reverse the order i.e. measure first a.get_text() and then b.transcribe(). My guess is that the second computation will always be faster due to low level GPU caches.
Try to reverse the order i.e. measure first a.get_text() and then b.transcribe(). My guess is that the second computation will always be faster due to low level GPU caches.
@dorinclisu thanks, It didn't work, the phenomenon is still the same after changing the order. For a 2-hour audio, using method b only takes 49 seconds, while using method a takes 831 seconds.
this is not transcription
now = time.time()
b.transcribe(audio_path, vad_filter=True)
print(time.time() - now)
this is the correct transcription:
now = time.time()
"".join(segment.text for segment in b.transcribe(audio_path, vad_filter=True)[0])
print(time.time() - now)
transcription is done in a lazily manner, please check the readme
@MahmoudAshraf97 Yes, you are right!
this is not transcription
now = time.time() b.transcribe(audio_path, vad_filter=True) print(time.time() - now)this is the correct transcription:
now = time.time() "".join(segment.text for segment in b.transcribe(audio_path, vad_filter=True)[0]) print(time.time() - now) transcription is done in a lazily manner, please check the readme
Yes, you are right!