Scribosermo

Train fast Speech-to-Text networks in different languages. An overview of the approach can be found in the paper Scribosermo: Fast Speech-to-Text models for German and other Languages.

Usage

Note: This repository is focused on training STT-networks, but you can find a short and experimental inference example here.

Requirements are:

Computer with a modern gpu and working nvidia+docker setup
Basic knowledge in python and deep-learning
A lot of training data in your required language
(preferable >100h for fine-tuning and >1000h for new languages)

General infos

File structure will look as follows:

my_speech2text_folder
    checkpoints
    corcua                 <- Library for datasets
    data_original
    data_prepared
    Scribosermo            <- This repository

Clone corcua:

git clone https://gitlab.com/Jaco-Assistant/corcua.git

Build and run docker container:

docker build -f Scribosermo/Containerfile -t scribosermo ./Scribosermo/

./Scribosermo/run_container.sh

Download and prepare voice data

Follow readme in preprocessing directory for preparing the voice data.

Create the language model

Follow readme in langmodel directory for generating the language model.

Training

Follow readme in training directory for training your network.
For easier inference follow the exporting readme in extras/exporting directory.

Datasets and Networks

You can find more details about the currently used datasets here.


Language	DE	EN	ES	FR	IT	PL	Noise
Duration (hours)	2370	982	817	1028	360	169	152
Datasets	37	1	8	7	5	3	3

Implemented networks: DeepSpeech1, DeepSpeech2, QuartzNet, Jasper, ContextNet(simplified), Conformer(simplified), CitriNet

Notes on the networks:

Not every network is fully tested, but each could be trained with one single audio file.
Some networks might differ slightly from their paper implementations.

Supported networks with their trainable parameter count (using English alphabet):


Network	DeepSpeech1	DeepSpeech2	QuartzNet	Jasper	ContextNetSimple	SimpleConformer	CitriNet
Config			5x5 / 15x5 / +LSTM		0.8	16x240x4	256 / 344 / +LSTM
Params	48.7M	120M	6.7M / 18.9M / 21.5M	323M	21.6M	21.7M	10.9M / 19.3M / 21.6M

Pretrained Checkpoints and Language Models

By default, the checkpoints are provided under the same licence as this repository, but a lot of datasets have extra conditions (for example non-commercial use only) which also have to be applied. The QuartzNet models are double licenced withs Nvidia's NGC, because they use their pretrained weights. Please check this yourself for the models you want to use.

Mozilla's DeepSpeech:

You can find the old models later on this page, in the old experiments section.
Below models are not compatible to the DeepSpeech client anymore!

German:

Quartznet15x5, CV only (WER: 7.5%): Link
Quartznet15x5, D37CV (WER: 6.6%): Link
Scorer: TCV, D37CV, PocoLg

English:

Quartznet5x5 (WER: 4.5%): Link
Quartznet15x5 (WER: 3.7%): Link
ContextNetSimple (WER: 4.9%): Link
Scorer: Link (to DeepSpeech)

Spanish:

Quartznet15x5, CV only (WER: 10.5%): Link
Quartznet15x5, D8CV (WER: 10.0%): Link
Scorer: KenSm, PocoLg

French:

Quartznet15x5, CV only (WER: 12.1%): Link
Quartznet15x5, D7CV (WER: 11.0%): Link
Scorer: KenSm, PocoLg

Italian:

Quartznet15x5, D5CV (WER: 11.5%): Link
Scorer: PocoLg

Citation

Please cite Scribosermo if you found it helpful for your research or business.

@article{
  scribosermo,
  title={Scribosermo: Fast Speech-to-Text models for German and other Languages},
  author={Bermuth, Daniel and Poeppel, Alexander and Reif, Wolfgang},
  journal={arXiv preprint arXiv:2110.07982},
  year={2021}
}

Contribution

You can contribute to this project in multiple ways:

Help to solve the open issues
Implement new networks or augmentation options
Train new models or improve the existing
(Requires a gpu and a lot of time, or multiple gpus and some time)
Experiment with the language models
Add a new language:
- Extend data/ directory with the alphabet and langdicts files
- Add speech datasets
- Find text corpora for the language model

Tests

See readme in tests directory for testing instructions.

Results

Language	Network	Additional Infos	Performance Results
EN	Quartznet5x5	Results from Nvidia-Nemo, using LS-dev-clean as test dataset	WER greedy: 0.0537
EN	Quartznet5x5	Converted model from Nvidia-Nemo, using LS-dev-clean as test dataset	Loss: 9.7666 CER greedy: 0.0268 CER with lm: 0.0202 WER greedy: 0.0809 WER with lm: 0.0506
EN	Quartznet5x5	Pretrained model from Nvidia-Nemo, one extra epoch on LibriSpeech to reduce the different spectrogram problem	Loss: 7.3253 CER greedy: 0.0202 CER with lm: 0.0163 WER greedy: 0.0654 WER with lm: 0.0446
EN	Quartznet5x5	above, using LS-dev-clean as test dataset (for better comparison with results from Nemo)	Loss: 6.9973 CER greedy: 0.0203 CER with lm: 0.0159 WER greedy: 0.0648 WER with lm: 0.0419

EN	Quartznet15x5	Results from Nvidia-Nemo, using LS-dev-clean as test dataset	WER greedy: 0.0379
EN	Quartznet15x5	Converted model from Nvidia-Nemo, using LS-dev-clean as test dataset	Loss: 5.8044 CER greedy: 0.0160 CER with lm: 0.0130 WER greedy: 0.0515 WER with lm: 0.0355
EN	Quartznet15x5	Pretrained model from Nvidia-Nemo, four extra epochs on LibriSpeech to reduce the different spectrogram problem	Loss: 5.3074 CER greedy: 0.0141 CER with lm: 0.0128 WER greedy: 0.0456 WER with lm: 0.0374
EN	Quartznet15x5	above, using LS-dev-clean as test dataset (for better comparison with results from Nemo)	Loss: 5.1035 CER greedy: 0.0132 CER with lm: 0.0108 WER greedy: 0.0435 WER with lm: 0.0308

Next trainings were all done with above pretrained Quartznet15x5 network.

Language	Datasets	Additional Infos	Performance Results
DE	Tuda	Learning rate 0.0001; Training time on 2x1080Ti was about 16h	Loss: 61.3615 CER greedy: 0.1481 CER with lm: 0.0914 WER greedy: 0.5502 WER with lm: 0.2381
DE	Tuda	Learning rate 0.001	Loss: 59.3143 CER greedy: 0.1329 CER with lm: 0.0917 WER greedy: 0.4956 WER with lm: 0.2448
DE	CommonVoice	Training time on 2x1080Ti was about 70h; Reusing scorer from DeepSpeech-Polyglot trainings	Loss: 11.6188 CER greedy: 0.0528 CER with lm: 0.0319 WER greedy: 0.1853 WER with lm: 0.0774
DE	CommonVoice	Above network; tested on Tuda dataset	Loss: 25.5442 CER greedy: 0.0473 CER with lm: 0.0340 WER greedy: 0.1865 WER with lm: 0.1199
DE	MLS	Learning rate 0.0001; Test on CommonVoice	Loss: 38.5387 CER greedy: 0.1967 CER with lm: 0.1616 WER greedy: 0.5894 WER with lm: 0.2584
DE	MLS + CommonVoice	Above network, continuing training with CommonVoice; Learning rate 0.001; Test on CommonVoice	Loss: 12.3243 CER greedy: 0.0574 CER with lm: 0.0314 WER greedy: 0.2122 WER with lm: 0.0788
DE	D37	Continued training from CV-checkpoint with 0.077 WER; Learning rate 0.001; Test on CommonVoice	Loss: 10.4031 CER greedy: 0.0491 CER with lm: 0.0369 WER greedy: 0.1710 WER with lm: 0.0824
DE	D37	Above network; Test on Tuda	Loss: 16.6407 CER greedy: 0.0355 CER with lm: 0.0309 WER greedy: 0.1530 WER with lm: 0.1113
DE	D37 + CommonVoice	Fine-tuned above network on CommonVoice again; Learning rate 0.001; Test on CommonVoice	Loss: 9.9733 CER greedy: 0.0456 CER with lm: 0.0323 WER greedy: 0.1601 WER with lm: 0.0760
DE	D37 + CommonVoice	Above network; Scorer build with all new training transcriptions; Beam size 1024; Test on CommonVoice	CER with lm: 0.0279 WER with lm: 0.0718
DE	D37 + CommonVoice	Like above; Test on Tuda	Loss: 17.3551 CER greedy: 0.0346 CER with lm: 0.0262 WER greedy: 0.1432 WER with lm: 0.1070

ES	CommonVoice	Frozen with 4 epochs, then full training	Eval-Loss: 29.0722 Test-Loss: 31.2095 CER greedy: 0.1568 CER with lm: 0.1461 WER greedy: 0.5289 WER with lm: 0.3446
ES	CommonVoice	Additional Dropout layers after each block and end convolutions; Continuing above frozen checkpoint	Eval-Loss: 30.5518 Test-Loss: 32.7240 CER greedy: 0.1643 CER with lm: 0.1519 WER greedy: 0.5523 WER with lm: 0.3538
FR	CommonVoice	Frozen with 4 epochs, then full training	Eval-Loss: 26.9454 Test-Loss: 30.6238 CER greedy: 0.1585 CER with lm: 0.1821 WER greedy: 0.4570 WER with lm: 0.4220

ES	CommonVoice	Updated augmentations; Continuing above frozen checkpoint	Eval-Loss: 28.0187
ES	CommonVoice	Like above, but lower augmentation strength	Eval-Loss: 26.5313
ES	CommonVoice	Like above, but higher augmentation strength	Eval-Loss: 34.6475
ES	CommonVoice	Only spectrogram cut/mask augmentations	Eval-Loss: 27.2635
ES	CommonVoice	Random speed and pitch augmentation, only spectrogram cutout	Eval-Loss: 25.9359

ES	CommonVoice	Improved transfer-learning with alphabet extension	Eval-Loss: 13.0415 Test-Loss: 14.8321 CER greedy: 0.0742 CER with lm: 0.0579 WER greedy: 0.2568 WER with lm: 0.1410
ES	CommonVoice	Like above, with esrp-delta=0.1 and lr=0.002; Training for 29 epochs	Eval-Loss: 10.3032 Test-Loss: 12.1623 CER greedy: 0.0533 CER with lm: 0.0460 WER greedy: 0.1713 WER with lm: 0.1149
ES	CommonVoice	Above model; Extended scorer with Europarl+News dataset	CER with lm: 0.0439 WER with lm: 0.1074
ES	CommonVoice	Like above; Beam size 1024 instead of 256	CER with lm: 0.0422 WER with lm: 0.1053
ES	D8	Continued training from above CV-checkpoint; Learning rate reduced to 0.0002; Test on CommonVoice	Eval-Loss: 9.3886 Test-Loss: 11.1205 CER greedy: 0.0529 CER with lm: 0.0456 WER greedy: 0.1690 WER with lm: 0.1075
ES	D8 + CommonVoice	Fine-tuned above network on CommonVoice again; Learning rate 0.0002; Test on CommonVoice	Eval-Loss: 9.6201 Test-Loss: 11.3245 CER greedy: 0.0507 CER with lm: 0.0421 WER greedy: 0.1632 WER with lm: 0.1025
ES	D8 + CommonVoice	Like above; Beam size 1024 instead of 256	CER with lm: 0.0404 WER with lm: 0.1003

FR	CommonVoice	Similar to Spanish CV training above; Training for 26 epochs	Eval-Loss: 10.4081 Test-Loss: 13.6226 CER greedy: 0.0642 CER with lm: 0.0544 WER greedy: 0.1907 WER with lm: 0.1248
FR	CommonVoice	Like above; Beam size 1024 instead of 256	CER with lm: 0.0511 WER with lm: 0.1209
FR	D7	Continued training from above CV-checkpoint	Eval-Loss: 9.8695 Test-Loss: 12.7798 CER greedy: 0.0604 CER with lm: 0.0528 WER greedy: 0.1790 WER with lm: 0.1208
FR	D7 + CommonVoice	Fine-tuned above network on CommonVoice again	Eval-Loss: 9.8874 Test-Loss: 12.9053 CER greedy: 0.0613 CER with lm: 0.0536 WER greedy: 0.1811 WER with lm: 0.1208
FR	D7 + CommonVoice	Like above; Beam size 1024 instead of 256	CER with lm: 0.0501 WER with lm: 0.1167

DE	CommonVoice	Updated augmentations/pipeline to above Spanish training; Reduced esrp-delta 1.1 -> 0.1 compared to last DE run	Eval-Loss: 10.3421 Test-Loss: 11.4755 CER greedy: 0.0512 CER with lm: 0.0330 WER greedy: 0.1749 WER with lm: 0.0790
DE	D37	Continued training from above CV-checkpoint; Learning rate reduced by factor 10; Test on CommonVoice	Eval-Loss: 9.6293 Test-Loss: 10.7855 CER greedy: 0.0503 CER with lm: 0.0347 WER greedy: 0.1705 WER with lm: 0.0793
DE	D37 + CommonVoice	Fine-tuned above network on CommonVoice again; Test on CommonVoice	Eval-Loss: 9.3287 Test-Loss: 10.4325 CER greedy: 0.0468 CER with lm: 0.0309 WER greedy: 0.1599 WER with lm: 0.0741

Running some experiments with different language models:

Language	Datasets	Additional Infos	Performance Results
DE	D37 + CommonVoice	Use PocoLM instead of KenLM (similar LM size); Checkpoint from D37+CV training with WER=0.0718; Test on CommonVoice	CER with lm: 0.0285 WER with lm: 0.0701
DE	D37 + CommonVoice	Like above; Test on Tuda	CER with lm: 0.0265 WER with lm: 0.1037
DE	D37 + CommonVoice	Use unpruned language model (1.5GB instead of 250MB); Rest similar to above; Test on CommonVoice	CER with lm: 0.0276 WER with lm: 0.0673
DE	D37 + CommonVoice	Like above; Test on Tuda	CER with lm: 0.0261 WER with lm: 0.1026
DE	D37 + CommonVoice	Use pruned language model with similar size to English model (850MB); Rest similar to above; Test on CommonVoice	CER with lm: 0.0277 WER with lm: 0.0672
DE	D37 + CommonVoice	Like above; Test on Tuda	CER with lm: 0.0260 WER with lm: 0.1024
DE	D37 + CommonVoice	Checkpoint from D37+CV training with WER=0.0741; with large (850MB) scorer; Test on CommonVoice	CER with lm: 0.0299 WER with lm: 0.0712
DE	D37 + CommonVoice	Like above; Test on Tuda; Small and full scorers were behind above model with both testsets, too	CER with lm: 0.0280 WER with lm: 0.1066
DE	CommonVoice	Test above checkpoint from CV training with WER=0.0774 with PocoLM large	Test-Loss: 11.6184 CER greedy: 0.0528 CER with lm: 0.0312 WER greedy: 0.1853 WER with lm: 0.0748

ES	D8 + CommonVoice	Use PocoLM instead of KenLM (similar LM size); Checkpoint from D8+CV training with WER=0.1003; Test on CommonVoice	CER with lm: 0.0407 WER with lm: 0.1011
ES	D8 + CommonVoice	Like above; Large scorer (790MB)	CER with lm: 0.0402 WER with lm: 0.1002
ES	D8 + CommonVoice	Like above; Full scorer (1.2GB)	CER with lm: 0.0403 WER with lm: 0.1000

Experimenting with new architectures on LibriSpeech dataset:

Network	Additional Infos	Performance Results
ContextNetSimple (0.8)	Run with multiple full restarts, ~3:30h/epoch; increased LR from 0.001 to 0.01 since iteration 5; set LR to 0.02 in It8a and 0.005 in It8b	Eval-Loss-1: 64.2793 Eval-Loss-2: 24.5743 Eval-Loss-3: 19.4896 Eval-Loss-4: 18.4973 Eval-Loss-4: 18.4973 Eval-Loss-5: 9.3007 Eval-Loss-6: 8.1340 Eval-Loss-7: 7.5170 Eval-Loss-8a: 7.6870 Eval-Loss-8b: 7.2683
ContextNetSimple (0.8)	Test above checkpoint after iteration 8b	Test-Loss: 7.7407 CER greedy: 0.0237 CER with lm: 0.0179 WER greedy: 0.0767 WER with lm: 0.0492
SimpleConformer (16x240x4)	Completed training after 28 epochs (~3:20h/epoch), without any augmentations	Eval-Loss: 70.6178
Citrinet (344)	Completed training after 6 epochs (~4h/epoch), didn't learn anything	Eval-Loss: 289.7605
QuartzNet (15x5)	Continued old checkpoint	Eval-Loss: 5.0922 Test-Loss: 5.3353 CER greedy: 0.0139 CER with lm: 0.0124 WER greedy: 0.0457 WER with lm: 0.0368
QuartzNet (15x5+LSTM)	Frozen+Full training onto old checkpoint	Eval-Loss: 4.9105 Test-Loss: 5.3112 CER greedy: 0.0143 CER with lm: 0.0125 WER greedy: 0.0477 WER with lm: 0.0370

Tests with reduced dataset size and with multiple restarts:

Language	Datasets	Additional Infos	Performance Results
DE	CV short (314h)	Test with PocoLM large; about 18h on 2xNvidia-V100; Iteration 1	Eval-Loss: 12.5308 Test-Loss: 13.9343 CER greedy: 0.0654 CER with lm: 0.0347 WER greedy: 0.2391 WER with lm: 0.0834
DE	CV short (314h)	Iteration 2; about 22h	Eval-Loss: 11.3072 Test-Loss: 12.6970 CER greedy: 0.0556 CER with lm: 0.0315 WER greedy: 0.1986 WER with lm:0.0776
DE	CV short (314h)	Iteration 3; about 13h	Eval-Loss: 11.2485 Test-Loss: 12.5631 CER greedy: 0.0532 CER with lm: 0.0309 WER greedy: 0.1885 WER with lm: 0.0766
DE	CV short (314h)	Test Iteration 3 on Tuda	Test-Loss: 27.5804 CER greedy: 0.0478 CER with lm: 0.0326 WER greedy: 0.1913 WER with lm: 0.1166
DE	D37 + CommonVoice	Additional training iteration on CV using checkpoint from D37+CV training with WER=0.0718	Eval-Loss: 8.7156 Test-Loss: 9.8192 CER greedy: 0.0443 CER with lm: 0.0268 WER greedy: 0.1544 WER with lm:0.0664
DE	D37 + CommonVoice	Above, test on Tuda	Test-Loss: 19.2681 CER greedy: 0.0358 CER with lm: 0.0270 WER greedy: 0.1454 WER with lm:0.1023

ES	CV short (203h)	A fifth iteration with lr=0.01 did not converge; about 19h on 2xNvidia-V100 for first iteration	Eval-Loss-1: 10.8212 Eval-Loss-2: 10.7791 Eval-Loss-3: 10.7649 Eval-Loss-4: 10.7918
ES	CV short (203h)	Above, test of first iteration	Test-Loss: 12.5954 CER greedy: 0.0591 CER with lm: 0.0443 WER greedy: 0.1959 WER with lm: 0.1105
ES	CV short (203h)	Above, test of third iteration	Test-Loss: 12.6006 CER greedy: 0.0572 CER with lm: 0.0436 WER greedy: 0.1884 WER with lm: 0.1093
ES	D8 + CommonVoice	Additional training iterations on CV using checkpoint from D8+CV training with WER=0.1003; test of first iteration with PocoLM large; a second iteration with lr=0.01 did not converge	Eval-Loss-1: 9.5202 Eval-Loss-2: 9.6056 Test-Loss: 11.2326 CER greedy: 0.0501 CER with lm: 0.0398 WER greedy: 0.1606 WER with lm:0.0995
ES	CV short (203h)	Two step frozen training, about 13h+18h	Eval-Loss-1: 61.5673 Eval-Loss-2: 10.9956 Test-Loss: 12.7028 CER greedy: 0.0604 CER with lm: 0.0451 WER greedy: 0.2015 WER with lm: 0.1111
ES	CV short (203h)	Single step with last layer reinitialization, about 18h	Eval-Loss: 11.6488 Test-Loss: 13.4355 CER greedy: 0.0643 CER with lm: 0.0478 WER greedy: 0.2163 WER with lm: 0.1166

FR	CV short (364h)	The fourth iteration with lr=0.01; about 25h on 2xNvidia-V100 for first iteration; test of third iteration	Eval-Loss-1: 12.6529 Eval-Loss-2: 11.7833 Eval-Loss-3: 11.7141 Eval-Loss-4: 12.6193
FR	CV short (364h)	Above, test of third iteration	Test-Loss: 14.8373 CER greedy: 0.0711 CER with lm: 0.0530 WER greedy: 0.2142 WER with lm: 0.1248
FR	D7 + CommonVoice	Additional training iterations on CV using checkpoint from D7+CV training with WER=0.1167; test of first iteration with PocoLM large; a second iteration with lr=0.01 did not converge	Eval-Loss-1: 9.5452 Eval-Loss-2: 9.5860 Test-Loss: 12.5477 CER greedy: 0.0590 CER with lm: 0.0466 WER greedy: 0.1747 WER with lm:0.1104

IT	CommonVoice	Transfer from English	Eval-Loss: 12.7120 Test-Loss: 14.4017 CER greedy: 0.0710 CER with lm: 0.0465 WER greedy: 0.2766 WER with lm: 0.1378
IT	CommonVoice	Transfer from Spanish with alphabet shrinking	Eval-Loss: 10.7151 Test-Loss: 12.3298 CER greedy: 0.0585 CER with lm: 0.0408 WER greedy: 0.2208 WER with lm: 0.1216
IT	D5 + CommonVoice	Continuing above from Spanish	Eval-Loss: 9.3055 Test-Loss: 10.8521 CER greedy: 0.0543 CER with lm: 0.0403 WER greedy: 0.2000 WER with lm: 0.1170
IT	D5 + CommonVoice	Fine-tuned above checkpoint on CommonVoice again (lr=0.0001)	Eval-Loss: 9.3318 Test-Loss: 10.8453 CER greedy: 0.0533 CER with lm: 0.0395 WER greedy: 0.1967 WER with lm: 0.1153

Old experiments

The following experiments were run with an old version of this repository, at that time named as DeepSpeech-Polyglot, using the DeepSpeech1 network from Mozilla-DeepSpeech.
While they are outdated, some of them might still provide helpful information for training the new networks.

Old checkpoints and scorers:

German (D17S5 training and some older checkpoints, WER: 0.128, Train: ~1582h, Test: ~41h): Link
Spanish (CCLMTV training, WER: 0.165, Train: ~660h, Test: ~25h): Link
French (CCLMTV training, WER: 0.195, Train: ~787h, Test: ~25h): Link
Italian (CLMV training, WER: 0.248 Train: ~257h, Test: ~21h): Link
Polish (CLM training, WER: 0.034, Train: ~157h, Test: ~6h): Link

First experiments:
(Default dropout is 0.4, learning rate 0.0005):

Dataset	Additional Infos	Performance Results
Voxforge		WER: 0.676611 CER: 0.403916 loss: 82.185226
Voxforge	with augmentation	WER: 0.624573 CER: 0.348618 loss: 74.403786
Voxforge	without "äöü"	WER: 0.646702 CER: 0.364471 loss: 82.567413
Voxforge	cleaned data, without "äöü"	WER: 0.634828 CER: 0.353037 loss: 81.905258
Voxforge	above checkpoint, tested on not cleaned data	WER: 0.634556 CER: 0.352879 loss: 81.849220
Voxforge	checkpoint from english deepspeech, without "äöü"	WER: 0.394064 CER: 0.190184 loss: 49.066357
Voxforge	checkpoint from english deepspeech, with augmentation, without "äöü", dropout 0.25, learning rate 0.0001	WER: 0.338685 CER: 0.150972 loss: 42.031754
Voxforge	reduce learning rate on plateau, with noise and standard augmentation, checkpoint from english deepspeech, cleaned data, without "äöü", dropout 0.25, learning rate 0.0001, batch size 48	WER: 0.320507 CER: 0.131948 loss: 39.923031
Voxforge	above with learning rate 0.00001	WER: 0.350903 CER: 0.147837 loss: 43.451263
Voxforge	above with learning rate 0.001	WER: 0.518670 CER: 0.252510 loss: 62.927200
Tuda + Voxforge	without "äöü", checkpoint from english deepspeech, cleaned train and dev data	WER: 0.740130 CER: 0.462036 loss: 156.115921
Tuda + Voxforge	first Tuda then Voxforge, without "äöü", cleaned train and dev data, dropout 0.25, learning rate 0.0001	WER: 0.653841 CER: 0.384577 loss: 159.509476
Tuda + Voxforge + SWC + Mailabs + CommonVoice	checkpoint from english deepspeech, with augmentation, without "äöü", cleaned data, dropout 0.25, learning rate 0.0001	WER: 0.306061 CER: 0.151266 loss: 33.218510

Some results with some older code version:
(Default values: batch size 12, dropout 0.25, learning rate 0.0001, without "äöü", cleaned data , checkpoint from english deepspeech, early stopping, reduce learning rate on plateau, evaluation with scorer and top-500k words)

Dataset	Additional Infos	Losses	Training epochs of best model	Performance Results
Tuda + Voxforge + SWC + Mailabs + CommonVoice	test only with Tuda + CommonVoice others completely for training, language model with training transcriptions, with augmentation	Test: 29.363405 Validation: 23.509546	55	WER: 0.190189 CER: 0.091737
Tuda + Voxforge + SWC + Mailabs + CommonVoice	above checkpoint tested with 3-gram language model	Test: 29.363405		WER: 0.199709 CER: 0.095318
Tuda + Voxforge + SWC + Mailabs + CommonVoice	above checkpoint tested on Tuda only	Test: 87.074394		WER: 0.378379 CER: 0.167380

Some results with some older code version:
(Default values: batch size 36, dropout 0.25, learning rate 0.0001, without "äöü", cleaned data , checkpoint from english deepspeech, early stopping, reduce learning rate on plateau, evaluation with scorer and top-500k words, data augmentation)

Dataset	Additional Infos	Losses	Training epochs of best model	Performance Results
Voxforge	training from scratch	Test: 79.124008 Validation: 81.982976	29	WER: 0.603879 CER: 0.298139
Voxforge		Test: 44.312195 Validation: 47.915317	21	WER: 0.343973 CER: 0.140119
Voxforge	without reduce learning rate on plateau	Test: 46.160049 Validation: 48.926518	13	WER: 0.367125 CER: 0.163931
Voxforge	dropped last layer	Test: 49.844028 Validation: 52.722362	21	WER: 0.389327 CER: 0.170563
Voxforge	5 cycled training	Test: 42.973358		WER: 0.353841 CER: 0.158554

Tuda	training from scratch, correct train/dev/test splitting	Test: 149.653427 Validation: 137.645307	9	WER: 0.606629 CER: 0.296630
Tuda	correct train/dev/test splitting	Test: 103.179092 Validation: 132.243965	3	WER: 0.436074 CER: 0.208135
Tuda	dropped last layer, correct train/dev/test splitting	Test: 107.047821 Validation: 101.219325	6	WER: 0.431361 CER: 0.195361
Tuda	dropped last two layers, correct train/dev/test splitting	Test: 110.523621 Validation: 103.844562	5	WER: 0.442421 CER: 0.204504
Tuda	checkpoint from Voxforge with WER 0.344, correct train/dev/test splitting	Test: 100.846367 Validation: 95.410456	3	WER: 0.416950 CER: 0.198177
Tuda	10 cycled training, checkpoint from Voxforge with WER 0.344, correct train/dev/test splitting	Test: 98.007607		WER: 0.410520 CER: 0.194091
Tuda	random dataset splitting, checkpoint from Voxforge with WER 0.344 Important Note: These results are not meaningful, because same transcriptions can occur in train and test set, only recorded with different microphones	Test: 23.322618 Validation: 23.094230	27	WER: 0.090285 CER: 0.036212

CommonVoice	checkpoint from Tuda with WER 0.417	Test: 24.688297 Validation: 17.460029	35	WER: 0.217124 CER: 0.085427
CommonVoice	above tested with reduced testset where transcripts occurring in trainset were removed,	Test: 33.376812		WER: 0.211668 CER: 0.079157
CommonVoice + GoogleWavenet	above tested with GoogleWavenet	Test: 17.653290		WER: 0.035807 CER: 0.007342
CommonVoice	checkpoint from Voxforge with WER 0.344	Test: 23.460932 Validation: 16.641201	35	WER: 0.215584 CER: 0.084932
CommonVoice	dropped last layer	Test: 24.480028 Validation: 17.505738	36	WER: 0.220435 CER: 0.086921

Tuda + GoogleWavenet	added GoogleWavenet to train data, dev/test from Tuda, checkpoint from Voxforge with WER 0.344	Test: 95.555939 Validation: 90.392490	3	WER: 0.390291 CER: 0.178549
Tuda + GoogleWavenet	GoogleWavenet as train data, dev/test from Tuda	Test: 346.486420 Validation: 326.615474	0	WER: 0.865683 CER: 0.517528
Tuda + GoogleWavenet	GoogleWavenet as train/dev data, test from Tuda	Test: 477.049591 Validation: 3.320163	23	WER: 0.923973 CER: 0.601015
Tuda + GoogleWavenet	above checkpoint tested with GoogleWavenet	Test: 3.406022		WER: 0.012919 CER: 0.001724
Tuda + GoogleWavenet	checkpoint from english deepspeech tested with Tuda	Test: 402.102661		WER: 0.985554 CER: 0.752787
Voxforge + GoogleWavenet	added all of GoogleWavenet to train data, dev/test from Voxforge	Test: 45.643063 Validation: 49.620488	28	WER: 0.349552 CER: 0.143108
CommonVoice + GoogleWavenet	added all of GoogleWavenet to train data, dev/test from CommonVoice	Test: 25.029057 Validation: 17.511973	35	WER: 0.214689 CER: 0.084206
CommonVoice + GoogleWavenet	above tested with reduced testset	Test: 34.191067		WER: 0.213164 CER: 0.079121

Updated to DeepSpeech v0.7.3 and new english checkpoint:
(Default values: See flags.txt in releases, scorer with kaldi-tuda sentences only) (Testing with noise and speech overlay is done with older noiseaugmaster branch, which implemented this functionality)

Dataset	Additional Infos	Losses	Training epochs of best model	Performance Results
Voxforge		Test: 32.844025 Validation: 36.912005	14	WER: 0.240091 CER: 0.087971
Voxforge	without freq_and_time_masking augmentation	Test: 33.698494 Validation: 38.071722	10	WER: 0.244600 CER: 0.094577
Voxforge	using new audio augmentation options	Test: 29.280865 Validation: 33.294815	21	WER: 0.220538 CER: 0.079463

Voxforge	updated augmentations again	Test: 28.846869 Validation: 32.680268	16	WER: 0.225360 CER: 0.083504
Voxforge	test above with older noiseaugmaster branch	Test: 28.831675		WER: 0.238961 CER: 0.081555
Voxforge	test with speech overlay	Test: 89.661995		WER: 0.570903 CER: 0.301745
Voxforge	test with noise overlay	Test: 53.461609		WER: 0.438126 CER: 0.213890
Voxforge	test with speech and noise overlay	Test: 79.736122		WER: 0.581259 CER: 0.310365
Voxforge	second test with speech and noise to check random influence	Test: 81.241333		WER: 0.595410 CER: 0.319077

Voxforge	add speech overlay augmentation	Test: 28.843914 Validation: 32.341234	27	WER: 0.222024 CER: 0.083036
Voxforge	change snr=50:20~~9m to snr=30:15~~9	Test: 28.502413 Validation: 32.236247	28	WER: 0.226005 CER: 0.085475
Voxforge	test above with older noiseaugmaster branch	Test: 28.488537		WER: 0.239530 CER: 0.083855
Voxforge	test with speech overlay	Test: 47.783081		WER: 0.383612 CER: 0.175735
Voxforge	test with noise overlay	Test: 51.682060		WER: 0.428566 CER: 0.209789
Voxforge	test with speech and noise overlay	Test: 60.275940		WER: 0.487709 CER: 0.255167

Voxforge	add noise overlay augmentation	Test: 27.940659 Validation: 31.988175	28	WER: 0.219143 CER: 0.076050
Voxforge	change snr=50:20~~6 to snr=24:12~~6	Test: 26.588453 Validation: 31.151855	34	WER: 0.206141 CER: 0.072018
Voxforge	change to snr=18:9~6	Test: 26.311581 Validation: 30.531299	30	WER: 0.211865 CER: 0.074281
Voxforge	test above with older noiseaugmaster branch	Test: 26.300938		WER: 0.227466 CER: 0.073827
Voxforge	test with speech overlay	Test: 76.401451		WER: 0.499962 CER: 0.254203
Voxforge	test with noise overlay	Test: 44.011471		WER: 0.376783 CER: 0.165329
Voxforge	test with speech and noise overlay	Test: 65.408264		WER: 0.496168 CER: 0.246516

Voxforge	speech and noise overlay	Test: 27.101889 Validation: 31.407527	44	WER: 0.220243 CER: 0.082179
Voxforge	test above with older noiseaugmaster branch	Test: 27.087360		WER: 0.232094 CER: 0.080319
Voxforge	test with speech overlay	Test: 46.012951		WER: 0.362291 CER: 0.164134
Voxforge	test with noise overlay	Test: 44.035809		WER: 0.377276 CER: 0.171528
Voxforge	test with speech and noise overlay	Test: 53.832214		WER: 0.441768 CER: 0.218798

Tuda + Voxforge + SWC + Mailabs + CommonVoice	test with Voxforge + Tuda + CommonVoice others completely for training, with noise and speech overlay	Test: 22.055849 Validation: 17.613633	46	WER: 0.208809 CER: 0.087215
Tuda + Voxforge + SWC + Mailabs + CommonVoice	above tested on Voxforge devdata	Test: 16.395443		WER: 0.163827 CER: 0.056596
Tuda + Voxforge + SWC + Mailabs + CommonVoice	optimized scorer alpha and beta on Voxforge devdata	Test: 16.395443		WER: 0.162842
Tuda + Voxforge + SWC + Mailabs + CommonVoice	test with Voxforge + Tuda + CommonVoice, optimized scorer alpha and beta	Test: 22.055849		WER: 0.206960 CER: 0.086306
Tuda + Voxforge + SWC + Mailabs + CommonVoice	scorer (kaldi-tuda) with train transcriptions, optimized scorer alpha and beta	Test: 22.055849		WER: 0.134221 CER: 0.064267
Tuda + Voxforge + SWC + Mailabs + CommonVoice	scorer only out of train transcriptions, optimized scorer alpha and beta	Test: 22.055849		WER: 0.142880 CER: 0.064958
Tuda + Voxforge + SWC + Mailabs + CommonVoice	scorer (kaldi-tuda + europarl + news) with train transcriptions, optimized scorer alpha and beta	Test: 22.055849		WER: 0.135759 CER: 0.064773
Tuda + Voxforge + SWC + Mailabs + CommonVoice	above scorer with 1m instead of 500k top words, optimized scorer alpha and beta	Test: 22.055849		WER: 0.136650 CER: 0.066470
Tuda + Voxforge + SWC + Mailabs + CommonVoice	test with Tuda only	Test: 54.977085		WER: 0.250665 CER: 0.103428

Voxforge FR	speech and noise overlay	Test: 5.341695 Validation: 12.736551	49	WER: 0.175954 CER: 0.045416
CommonVoice + Css10 + Mailabs + Tatoeba + Voxforge FR	test with Voxforge + CommonVoice others completely for training, with speech and noise overlay	Test: 20.404339 Validation: 21.920289	62	WER: 0.302113 CER: 0.121300
CommonVoice + Css10 + Mailabs + Tatoeba + Voxforge ES	test with Voxforge + CommonVoice others completely for training, with speech and noise overlay	Test: 14.521997 Validation: 22.408368	51	WER: 0.154061 CER: 0.055357

Using new CommonVoice v5 releases:
(Default values: See flags.txt in released checkpoints; using correct instead of random splits of CommonVoice; Old german scorer alpha+beta for all)

Language	Dataset	Additional Infos	Losses	Training epochs of best model	Performance Results
DE	CommonVoice + CssTen + LinguaLibre + Mailabs + SWC + Tatoeba + Tuda + Voxforge + ZamiaSpeech	test with CommonVoice + Tuda + Voxforge, others completely for training; with speech and noise overlay; top-488538 scorer (words occurring at least five times)	Test: 29.286192 Validation: 26.864552	30	WER: 0.182088 CER: 0.081321
DE	CommonVoice + CssTen + LinguaLibre + Mailabs + SWC + Tatoeba + Tuda + Voxforge + ZamiaSpeech	like above, but using each file 10x with different augmentations	Test: 25.694464 Validation: 23.128045	16	WER: 0.166629 CER: 0.071999
DE	CommonVoice + CssTen + LinguaLibre + Mailabs + SWC + Tatoeba + Tuda + Voxforge + ZamiaSpeech	above checkpoint, tested on Tuda only	Test: 57.932476		WER: 0.260319 CER: 0.109301
DE	CommonVoice + CssTen + LinguaLibre + Mailabs + SWC + Tatoeba + Tuda + Voxforge + ZamiaSpeech	optimized scorer alpha+beta	Test: 25.694464		WER: 0.166330 CER: 0.070268
ES	CommonVoice + CssTen + LinguaLibre + Mailabs + Tatoeba + Voxforge	test with CommonVoice, others completely for training; with speech and noise overlay; top-303450 scorer (words occurring at least twice)	Test: 25.443010 Validation: 22.686161	42	WER: 0.193316 CER: 0.093000
ES	CommonVoice + CssTen + LinguaLibre + Mailabs + Tatoeba + Voxforge	optimized scorer alpha+beta	Test: 25.443010		WER: 0.187535 CER: 0.083490
FR	CommonVoice + CssTen + LinguaLibre + Mailabs + Tatoeba + Voxforge	test with CommonVoice, others completely for training; with speech and noise overlay; top-316458 scorer (words occurring at least twice)	Test: 29.761099 Validation: 24.691544	52	WER: 0.231981 CER: 0.116503
FR	CommonVoice + CssTen + LinguaLibre + Mailabs + Tatoeba + Voxforge	optimized scorer alpha+beta	Test: 29.761099		WER: 0.228851 CER: 0.109247
IT	CommonVoice + LinguaLibre + Mailabs + Voxforge	test with CommonVoice, others completely for training; with speech and noise overlay; top-51216 scorer out of train transcriptions (words occurring at least twice)	Test: 25.536196 Validation: 23.048596	46	WER: 0.249197 CER: 0.093717
IT	CommonVoice + LinguaLibre + Mailabs + Voxforge	optimized scorer alpha+beta	Test: 25.536196		WER: 0.247785 CER: 0.096247
PL	CommonVoice + LinguaLibre + Mailabs	test with CommonVoice, others completely for training; with speech and noise overlay; top-39175 scorer out of train transcriptions (words occurring at least twice)	Test: 14.902746 Validation: 15.508280	53	WER: 0.040128 CER: 0.022947
PL	CommonVoice + LinguaLibre + Mailabs	optimized scorer alpha+beta	Test: 14.902746		WER: 0.034115 CER: 0.020230

Dataset	Additional Infos	Losses	Training epochs of best model	Total training duration	WER
Voxforge	updated rlrop; frozen transfer-learning; no augmentation; es_min_delta=0.9	Test: 37.707958 Validation: 41.832220	12 + 3	42 min
Voxforge	like above; without frozen transfer-learning;	Test: 36.630890 Validation: 41.208125	7	28 min
Voxforge	dropped last layer	Test: 42.516270 Validation: 47.105518	8	28 min
Voxforge	dropped last layer; with frozen transfer-learning in two steps	Test: 36.600590 Validation: 40.640134	14 + 8	42 min
Voxforge	updated rlrop; with augmentation; es_min_delta=0.9	Test: 35.540062 Validation: 39.974685	6	46 min
Voxforge	updated rlrop; with old augmentations; es_min_delta=0.1	Test: 30.655203 Validation: 33.655750	9	48 min
TerraX + Voxforge + YKollektiv	Voxforge only for dev+test but not in train; rest like above	Test: 32.936977 Validation: 36.828410	19	4:53 h
Voxforge	layer normalization; updated rlrop; with old augmentations; es_min_delta=0.1	Test: 57.330410 Validation: 61.025009	45	2:37 h
Voxforge	dropout=0.3; updated rlrop; with old augmentations; es_min_delta=0.1	Test: 30.353968 Validation: 33.144178	20	1:15 h
Voxforge	es_min_delta=0.05; updated rlrop; with old augmentations	Test: 29.884317 Validation: 32.944382	12	54 min
Voxforge	fixed updated rlrop; es_min_delta=0.05; with old augmentations	Test: 28.903509 Validation: 32.322064	34	1:40 h
Voxforge	from scratch; no augmentations; fixed updated rlrop; es_min_delta=0.05	Test: 74.347054 Validation: 79.838900	28	1:26 h	0.38
Voxforge	wav2letter; stopped by hand after one/two overnight runs; from scratch; no augmentations; single gpu;		18/37	16/33 h	0.61/0.61

Language	Datasets	Additional Infos	Training epochs of best model Total training duration	Losses Performance Results
DE	BasFormtask + BasSprecherinnen + CommonVoice + CssTen + Gothic + LinguaLibre + Kurzgesagt + Mailabs + MussteWissen + PulsReportage + SWC + Tatoeba + TerraX + Tuda + Voxforge + YKollektiv + ZamiaSpeech + 5x CV-SingleWords (D17S5)	test with Voxforge + Tuda + CommonVoice others completely for training; files 10x with different augmentations; noise overlay; fixed updated rlrop; optimized german scorer; updated dataset cleaning algorithm -> include more short files; added the CV-SingleWords dataset five times because the last checkpoint had problems detecting short speech commands -> a bit more focus on training shorter words	24 7d8h (7x V100-GPU)	Test: 25.082140 Validation: 23.345149 WER: 0.161870 CER: 0.068542
DE	D17S5	test above on CommonVoice only		Test: 18.922359 WER: 0.127766 CER: 0.056331
DE	D17S5	test above on Tuda only, using all test files and full dataset cleaning		Test: 54.675545 WER: 0.245862 CER: 0.101032
DE	D17S5	test above on Tuda only, using official split (excluding Realtek recordings), only text replacements		Test: 39.755287 WER: 0.186023 CER: 0.064182
FR	CommonVoice + CssTen + LinguaLibre + Mailabs + Tatoeba + Voxforge	test with CommonVoice, others completely for training; two step frozen transfer learning; augmentations only in second step; files 10x with different augmentations; noise overlay; fixed updated rlrop; optimized scorer; updated dataset cleaning algorithm -> include more short files	14 + 34 5h + 5d13h (7x V100-GPU)	Test: 24.771702 Validation: 20.062641 WER: 0.194813 CER: 0.092049
ES	CommonVoice + CssTen + LinguaLibre + Mailabs + Tatoeba + Voxforge	like above	15 + 27 5h + 3d1h (7x V100-GPU)	Test: 21.235971 Validation: 18.722595 WER: 0.165126 CER: 0.075567

egorsmkv/Scribosermo