Unofficial Keras implementation of Deep Speaker | Paper | Pretrained Models
Model name | Testing dataset | Num speakers | F | TPR | ACC | EER | Training Logs | Download model |
---|---|---|---|---|---|---|---|---|
ResCNN Softmax trained | LibriSpeech all(*) | 2484 | 0.789 | 0.733 | 0.996 | 0.043 | Click | Click |
ResCNN Softmax+Triplet trained | LibriSpeech all(*) | 2484 | 0.843 | 0.825 | 0.997 | 0.025 | Click | Click |
(*) all includes: dev-clean, dev-other, test-clean, test-other, train-clean-100, train-clean-360, train-other-500.
Deep Speaker is a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering.
- tensorflow>=2.0
- keras>=2.3.1
pip install -r requirements.txt
If you see this error: libsndfile not found
, run this: sudo apt-get install libsndfile-dev
.
The code for training is available in this repository. It takes a bit less than a week with a GTX1070 to train the models.
System requirements for a complete training are:
- At least 300GB of free disk space on a fast SSD (250GB just for all the uncompressed + processed data)
- 32GB of memory and at least 32GB of swap (can create swap with SSD space).
- A NVIDIA GPU such as the 1080Ti.
pip uninstall -y tensorflow && pip install tensorflow-gpu
./deep-speaker download_librispeech # if the download is too slow, consider replacing [wget] by [axel -n 10 -a] in download_librispeech.sh.
./deep-speaker build_mfcc # will build MFCC for softmax pre-training and triplet training.
./deep-speaker build_model_inputs # will build inputs for softmax pre-training.
./deep-speaker train_softmax # takes ~3 days.
./deep-speaker train_triplet # takes ~3 days.
NOTE: If you want to use your own dataset, make sure you follow the directory structure of librispeech. Audio files have to be in .flac
. format. If you have .wav
, you can use ffmpeg
to make the conversion. Both formats are flawless (FLAC is compressed WAV).
- Download the trained models
Model name | Used datasets for training | Num speakers | Model Link |
---|---|---|---|
ResCNN Softmax trained | LibriSpeech train-clean-360 | 921 | Click |
ResCNN Softmax+Triplet trained | LibriSpeech all | 2484 | Click |
- Run with pretrained model
import random
import numpy as np
from audio import read_mfcc
from batcher import sample_from_mfcc
from constants import SAMPLE_RATE, NUM_FRAMES
from conv_models import DeepSpeakerModel
from test import batch_cosine_similarity
# Reproducible results.
np.random.seed(123)
random.seed(123)
# Define the model here.
model = DeepSpeakerModel()
# Load the checkpoint.
model.m.load_weights('ResCNN_triplet_training_checkpoint_265.h5', by_name=True)
# Sample some inputs for WAV/FLAC files for the same speaker.
mfcc_001 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_001.wav', SAMPLE_RATE), NUM_FRAMES)
mfcc_002 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_002.wav', SAMPLE_RATE), NUM_FRAMES)
# Call the model to get the embeddings of shape (1, 512) for each file.
predict_001 = model.m.predict(np.expand_dims(mfcc_001, axis=0))
predict_002 = model.m.predict(np.expand_dims(mfcc_002, axis=0))
# Do it again with a different speaker.
mfcc_003 = sample_from_mfcc(read_mfcc('samples/1255-90413-0001.flac', SAMPLE_RATE), NUM_FRAMES)
predict_003 = model.m.predict(np.expand_dims(mfcc_003, axis=0))
# Compute the cosine similarity and check that it is higher for the same speaker.
print('SAME SPEAKER', batch_cosine_similarity(predict_001, predict_002)) # SAME SPEAKER [0.81564593]
print('DIFF SPEAKER', batch_cosine_similarity(predict_001, predict_003)) # DIFF SPEAKER [0.1419204]
- Commands to reproduce the test results after the training
$ export CUDA_VISIBLE_DEVICES=0; python cli.py test-model --working_dir ~/.deep-speaker-wd/triplet-training/ --
checkpoint_file checkpoints-softmax/ResCNN_checkpoint_102.h5
f-measure = 0.789, true positive rate = 0.733, accuracy = 0.996, equal error rate = 0.043
$ export CUDA_VISIBLE_DEVICES=0; python cli.py test-model --working_dir ~/.deep-speaker-wd/triplet-training/ --checkpoint_file checkpoints-triplets/ResCNN_checkpoint_265.h5
f-measure = 0.849, true positive rate = 0.798, accuracy = 0.997, equal error rate = 0.025