This script modifies methods of Whisper's model to gain access to the predicted timestamp tokens of each word(token) without needing additional inference. It also stabilizes the timestamps down to the word(token) level to ensure chronology.
- Add function to stabilize with multiple inferences
- Add word timestamping (it is only token based right now)
git clone https://github.com/jianfch/stable-ts.git
cd stable-ts
import whisper
from stable_whisper import modify_model
model = whisper.load_model('base', 'cuda')
modify_model(model)
results = model.transcribe('audio.mp3')
stab_segments = results['segments']
first_segment_token_timestamps = stab_segment[0]['word_timestamps']
# or to get token timestamps that adhere more to the top prediction
from stable_whisper import stabilize_timestamps
stab_segments = stabilize_timestamps(result, top_focus=True)
# token-level
from stable_whipser import results_to_token_srt
# after you get result from modified model
# this treats the token timestamps as end time of the tokens
results_to_token_srt(result, 'audio.srt') # will combine tokens if their timestamps overlap
# sentence-level
from stable_whipser import results_to_sentence_srt
# after you get result from modified model
results_to_sentence_srt(result, 'audio.srt')
- The "word" timestamps are actually token timestamps. Since token:word is not always 1:1 (varies by language), you may need to do some additional processing to get individual word timings.
- The timing can still be off sync depending on the model and audio.
- Haven't done any extensive testing to conclude how to interpret the word timestamps. Whether it is beginning/middle/end of the word(token), it's up to you decide how to use the timestamps.
- The
unstable_word_timestamps
are left in the results, so you can possibly find better way to utilize them.
This project is licensed under the MIT License - see the LICENSE file for details
Slight modification of the original work: