A small python library to find differences between audio and transcriptions
Example (audio as mp4 to allow an embed):
sphere.mp4
from transcription_diff.text_diff import transcription_diff, render_text_diff
diff = transcription_diff("You can go pretty far in life if you're a perfect sphere in a vacuum", "sphere.mp4")
print(render_text_diff(diff))
! Well
You can go pretty far in life
! when
+ if
you're a perfect sphere in a vacuum
- The library relies on openai-whisper to perform Audio Speech Recognition unguided by the transcription
- It then compares the expected transcription to the output of Whisper, ignoring superfluous characters
- It returns the output in a simple structure, keeping the original text format of the transcription
- Only a single hypothesis is considered for the ASR output, leaving the possibility of missing a hypothesis that would satisfy the expected transcription
- The ASR output is not in the phoneme space, making homophones prone to false positives
- Rare words unknown to Whisper require to be explicitly passed to the function, and have no guarantee of being properly recognized by Whisper
- Currently only annotates up to 30 seconds of audio per sample
pip install transcription-diff
- Phoneme-level comparison
- User handling of model cache
- Support for audios longer than 30s
- More robust support for non-English languages
- Inverse normalization support for less false positives