/transcription-diff

A python library to find differences between audio and transcriptions

Primary LanguagePythonMIT LicenseMIT

transcription-diff

A small python library to find differences between audio and transcriptions

Example (audio as mp4 to allow an embed):

sphere.mp4
from transcription_diff.text_diff import transcription_diff, render_text_diff

diff = transcription_diff("You can go pretty far in life if you're a perfect sphere in a vacuum", "sphere.mp4")
print(render_text_diff(diff))
! Well
You can go pretty far in life
! when
+ if
you're a perfect sphere in a vacuum

Mechanism

  • The library relies on openai-whisper to perform Audio Speech Recognition unguided by the transcription
  • It then compares the expected transcription to the output of Whisper, ignoring superfluous characters
  • It returns the output in a simple structure, keeping the original text format of the transcription

Limitations

  • Only a single hypothesis is considered for the ASR output, leaving the possibility of missing a hypothesis that would satisfy the expected transcription
  • The ASR output is not in the phoneme space, making homophones prone to false positives
  • Rare words unknown to Whisper require to be explicitly passed to the function, and have no guarantee of being properly recognized by Whisper
  • Currently only annotates up to 30 seconds of audio per sample

Installation

pip install transcription-diff

Short term TODOs

  • Phoneme-level comparison
  • User handling of model cache
  • Support for audios longer than 30s

Long shot TODOs

  • More robust support for non-English languages
  • Inverse normalization support for less false positives