This recipe trains acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) on paired data consisting of word labels (given by their character sequences) and spoken word segments.
The training objective is based on the multiview triplet loss functions
of Wanjia et al., 2016.
Hard negative sampling was added in Settle et al., 2019 to improve
training speed (similar to src/multiview_triplet_loss_old.py
). The current version (see src/multiview_triplet_loss.py
) uses semi-hard negative sampling Schroff et al. (instead of hard negative sampling) and includes obj1
from Wanjia et al. in the loss.
python 3, pytorch 1.4, h5py, numpy, scipy
Use this link to download the dataset.
Edit train_config.json
and run train.sh
./train.sh
Edit eval_config.json
and run eval.sh
./eval.sh
With the default train_config.json you should obtain the following results:
acoustic_ap= 0.79
crossview_ap= 0.75
This repo is forked from Shane Settle's agwe-recipe repo.