Introduction

This repository contains Kaldi-compatible implementation of the mixup technique presented in the Interspeech 2018 paper "An Investigation of Mixup Training Strategies for Acoustic Models in ASR".

If you use this code for your research, please cite our paper:

@inproceedings{Medennikov_mixup2018,
  author={Ivan Medennikov and Yuri Khokhlov and Aleksei Romanenko and Dmitry Popov and Natalia Tomashenko and Ivan Sorokin and Alexander Zatvornitskiy},
  title={An Investigation of Mixup Training Strategies for Acoustic Models in ASR},
  year=2020,
  booktitle={Proc. Interspeech 2018},
  pages={2903--2907},
  doi={10.21437/Interspeech.2018-2191},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2191}
}

If you have any questions on the paper or this implementation, please contact to corresponding author Ivan Medennikov (medennikov@speechpro.com).

Licence

Apache 2.0

How to use

Utilities nnet3-mixup-egs and nnet3-chain-mixup-egs are intended to be used instead of nnet3-copy-egs and nnet3-chain-copy-egs in Kaldi training scripts. In order to use mixup utilities you should replace nnet3-copy-egs and/or nnet3-chain-copy-egs here

common.py, rev. eacf34a85ab7ece6a76bd73b9443bc2fe62ac6f1

method train_new_models(), line ~122

ark,bg:nnet3-copy-egs {frame_opts} {multitask_egs_opts}

with

ark,bg:nnet3-mixup-egs {frame_opts} {multitask_egs_opts}

and here

acoustic_model.py, rev. bba22b58407a3243e3fa847986753266e122d015

method train_new_models(), line ~199

ark,bg:nnet3-chain-copy-egs {multitask_egs_opts}

with

ark,bg:nnet3-chain-mixup-egs {multitask_egs_opts}

respectively.

Installation guide

Prerequisites

Install boost

$ sudo apt-get install libboost-all-dev

Install CMake

$ sudo apt-get install cmake

Install git

$ sudo apt-get install git

Building project

Clone mixup project repository

$ git clone https://github.com/speechpro/mixup.git

$ cd mixup

Clone Kaldi submodule

$ git submodule init

$ git submodule update

Build Kaldi dependencies

$ cd kaldi/tools

$ make

or if you want to speedup the building process run:

$ make -j $(nproc)

In case of errors or if you want to check the prerequisites for Kaldi see INSTALL file.

Build Kaldi

$ cd ../src

$ ./configure --shared

$ make depend -j $(nproc)

$ make -j $(nproc)

In case of errors or for additinal building options see INSTALL file.

Generate mixup project

$ cd ../..

$ mkdir build

$ cd build

$ cmake ..

Build mixup modules

$ make -j $(nproc)

Install mixup modules

$ make install

This operation will place mixup modules in to the corresponding Kaldi binary folders.

You may need to add line

export LD_LIBRARY_PATH=$KALDI_ROOT/src/lib:$KALDI_ROOT/tools/openfst/lib:$LD_LIBRARY_PATH

to your path.sh.

Program options

Mixup utilities have a number of parameters and modes of operation. In order to simplify their embedding all parameters can be passed in two equivalent ways: as command line program options and as environment variables.

You can find detailed explanation of the parameters and investigation of the mixup effectiveness in various operation modes in [1].

nnet3-mixup-egs

Command line	Environment variable	Allowable values	Default	Meaning
--mix-mode	MIXUP_MIX_MODE	local, global, class, shift	global	Mixup mode
--distrib	MIXUP_DISTRIB	uniform:min,max, beta:alpha, beta2:alpha	uniform:0.0,0.5	Mixup scaling factors distribution
--transform	MIXUP_TRANSFORM	"", sigmoid:k	""	Mixup scaling factor transform function for labels
--min-num	MIXUP_MIN_NUM	integer > 0	1	Minimum number of admixtures
--max-num	MIXUP_MAX_NUM	integer >= min-num	1	Maximum number of admixtures
--min-shift	MIXUP_MIN_SHIFT	integer > 0	1	Minimum sequence shift size (shift mode)
--max-shift	MIXUP_MAX_SHIFT	integer >= min-shift	3	Maximum sequence shift size (shift mode)
--fixed-egs	MIXUP_FIXED_EGS	float in the range [0, 1]	0.1	Portion of examples to leave untouched
--fixed-frames	MIXUP_FIXED_FRAMES	float in the range [0, 1]	0.1	Portion of frames to leave untouched
--left-range	MIXUP_LEFT_RANGE	integer > 0	3	Left range to pick an admixture frame (local mode)
--right-range	MIXUP_RIGHT_RANGE	integer > 0	3	Right range to pick an admixture frame (local mode)
--buff-size	MIXUP_BUFF_SIZE	integer > 0	500	Buffer size for data shuffling (global mode)
--compress	MIXUP_COMPRESS	0, 1	0	Compress features and i-vectors

nnet3-chain-mixup-egs

Command line	Environment variable	Allowable values	Default	Meaning
--mix-mode	MIXUP_MIX_MODE	global, shift	global	Mixup mode
--distrib	MIXUP_DISTRIB	uniform:min,max, beta:alpha, beta2:alpha	uniform:0.0,0.5	Mixup scaling factors distribution*
--scale-fst-algo	MIXUP_SCALE_FST_ALGO	"", default[:scale[,eps]], balanced[:scale[,eps]]	""	Scale supervision FSTs algorithm**
--swap-scales	MIXUP_SWAP_SCALES	true, false	false	Swap supervision FST scales
--max-super	MIXUP_MAX_SUPER	true, false	false	Get supervision from example with maximum scale
--min-shift	MIXUP_MIN_SHIFT	integer > 0	1	Minimum sequence shift size (shift mode)
--max-shift	MIXUP_MAX_SHIFT	integer >= min-shift	3	Maximum sequence shift size (shift mode)
--fixed	MIXUP_FIXED	float in the range [0, 1]	0.1	The portion of the data to leave untouched
--buff-size	MIXUP_BUFF_SIZE	integer > 0	500	Buffer size for data shuffling (global mode)
--frame-shift	MIXUP_FRAME_SHIFT	integer >= 0	0	Allows you to shift time values in the supervision data (excluding iVector data) - useful in augmenting data. Note, the outputs will remain at the closest exact multiples of the frame subsampling
--compress	MIXUP_COMPRESS	0, 1	0	Compress features and i-vectors

* Mixup scaling factors distribution. In case of --distrib=beta:alpha we use the standard beta probability distribution with symmetric shape (β=α). But when --distrib=beta2:alpha we use modified beta distribution: if sampled value ρ greater 0.5 we use (1-ρ).

float RandomScaleBeta2::Value() {
    const float value = (*distrib)(rand_gen);
    if (value <= 0.5) {
        return value;
    } else {
        return (1.0 - value);
    }
}

** Scale supervision FSTs algorithm. When merging supervision FSTs we apply epsilon restriction as folows. If scaling factor less eps we leave example FST unchanged. If 1.0 minus scaling factor less eps we use admixture FST instead of fusion. Default value of eps is 0.001.

void ExampleMixer::FuseGraphs(const fst_t& _admixture, float _admx_scale, fst_t& _example) const {
    if (_admx_scale < scale_eps) {
        return;
    } else if ((1.0 - _admx_scale) < scale_eps) {
        _example = _admixture;
        return;
    }
    ...
    ...
}

References

[1] Ivan Medennikov, Yuri Khokhlov, Aleksei Romanenko, Dmitry Popov, Natalia Tomashenko, Ivan Sorokin, Alexander Zatvornitskiy, "An investigation of mixup training strategies for acoustic models in ASR", Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), 2018

[2] Tomashenko, N., Khokhlov, Y., Estève, Y. (2018) Speaker Adaptive Training and Mixup Regularization for Neural Network Acoustic Models in Automatic Speech Recognition. Proc. Interspeech 2018, ‎2414-2418, DOI: 10.21437/Interspeech.‎2018-2209