
Experimental code: sound file preprocessing to optimize Whisper transcriptions without hallucinated texts

Primary LanguagePython


Experimental code: sound file preprocessing to optimize Whisper transcriptions without hallucinated texts

See this discussion: openai/whisper#679


  • remove noise by voice extraction using Facebook Demucs or Deezer Spleeter
  • remove silences, and normalize loudness with ffmpeg.
  • remove noise parts using Silero VAD.
  • add voice markers.
  • apply speech compressor (requires ffmpeg 4.4, while Google Colab is 4.2, it has to be upgraded, see below).
  • try to transcribe. If markers are present in output, transcription is OK.
  • if not, try to invert markers. If markers are present in output, transcription is OK.
  • if not, try without markers.


May be used to produce "accurate transcriptions" for WhisperTimeSync:

May be tested using NeuroSpell Dictaphone:

WhisperHallu and WhisperTimeSync are used to extract vocals and lyrics in karaok-AI:

Google Colab

Standard Whisper:

Faster Whisper:


Upgrade ffmpeg to version 4.4 on Google Colab

! add-apt-repository -y ppa:savoury1/ffmpeg4
! apt-get -qq install -y ffmpeg

!ffmpeg -version

ffmpeg version 4.4.3-0ubuntu1~20.04.sav2 Copyright (c) 2000-2022 the FFmpeg developers

Demucs (if used)

pip install -U demucs

Spleeter (if used)

pip install spleeter

Standard Whisper (if used)

sudo apt update && sudo apt install ffmpeg

sudo apt install python3
sudo apt install python3-pip
sudo apt install virtualenv

virtualenv -p python3 ../venvWhisper
. ../venvWhisper/bin/activate

pip install -U openai-whisper

pip3 install torchaudio

Faster Whisper (if used)

sudo apt update && sudo apt install ffmpeg

sudo apt install python3
sudo apt install python3-pip
sudo apt install virtualenv

virtualenv -p python3 ../venvFasterWhisper
. ../venvFasterWhisper/bin/activate

git clone https://github.com/guillaumekln/faster-whisper.git
cd faster-whisper/

pip install -e .[conversion]
pip install -e .

cd ..

ct2-transformers-converter --model openai/whisper-medium --output_dir whisper-medium-ct2 --quantization float16
ct2-transformers-converter --model openai/whisper-large --output_dir whisper-large-ct2 --quantization float16

pip3 install torchaudio


from transcribeHallu import loadModel
from transcribeHallu import transcribePrompt

##### The audio language may be different from the one for the output transcription.

##### Activate this for music file to get a minimal processing

##### Need to be adapted for each language.
##### For prompt examples, see transcribeHallu.py getPrompt(lng:str)
prompt= "Whisper, Ok. "\
	+"A pertinent sentence for your purpose in your language. "\
	+"Ok, Whisper. Whisper, Ok. "\
	+"Ok, Whisper. Whisper, Ok. "\
	+"Please find here, an unlikely ordinary sentence. "\
	+"This is to avoid a repetition to be deleted. "\
	+"Ok, Whisper. "

##### Model size to use

result = transcribePrompt(path=path, lng=lng, prompt=prompt, lngInput=lngInput,isMusic=isMusic)

This tool is a demonstration of our know-how.
If you are interested in a commercial/industrial AI linguistic project, contact us: