
Primary LanguagePython

Minimum Word Error Rate Training for Speech Separation

Minimum word error training approach using VoiceFilter and DeepSpeech.


  • Train dataset: 281,241 mixture
  • Validation dataset: 80 mixture
  • Test dataset: 5,567 mixture
  • Batch size: 8
  • hardware: NVIDIA Tesla V100 PCIe 32 GB


  • Result based on 80,000 iteration.
  • The test dataset is fully mixed, therefore, both SDR and WER of the original mixture is already poor.
  • To compare to the performance of original, need to use partially mixed dataset.
System Mean SDR (dB) Mean WER (%)
Before Separation 0.10 92.7
Ours 4.09 55.7
VoiceFilter 3.99 59.2

  • The red line is ours, the pink line is voicefilter.



  • Download relevent pretraind model
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.5.0/deepspeech-0.5.0-models.tar.gz
tar xvfz deepspeech-0.5.0-models.tar.gz
  • Install python packages
  pip install -r requirement.txt
  • Run the 8 clients code with a port number
  python client.py 8080
  python client.py 8081
  python client.py 8087


Training process

  • After prepare, all the require audio data, generate the new ground truth.
#In deepspeech-client
(venv) ./generator.sh [PATH_OF_AUDIO_FILE]
  • Then start training the same way in README.md.


Apache License 2.0

This repository contains codes adapted/copied from the followings: