- Clone the respository
- Create conde environment with dependencies:
conda env create -f environment.yaml -n [env-name]&&conda activate [env-name]
- Create a pretrained folder:
mdkir -p pretrained_models/audio_encoder
- Go to
and download the pretrained ResNet38 audio encoder model:gdown https://zenodo.org/records/3987831/files/ResNet38_mAP%3D0.434.pth?download=1 -O ResNet38.pth
- Download AudioCaps and Clotho datasets. AudioCaps dataset can be downloaded at link and Clotho dataset can be downloaded at link.
- Unzip datasets and put wavefiles under
- The training config is in the setting folder
- Set value of dataset parameter in the config file to etheir "AudioCaps" or "Clotho" to train model on AudioCaps or Clotho dataset.
- Run experiments:
python train.py -n [exp_name] -c m-ltm-settings
- Download the test data of ESC50 from the link
- Run the evaluation:
python trainer/eval_esc50.py -c m-ltm-settings -p [pretrained model's folder]
title={Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation},
author={Manh Luong and Khai Nguyen and Nhat Ho and Reza Haf and Dinh Phung and Lizhen Qu},
booktitle={The Twelfth International Conference on Learning Representations},
- We use the model and training code from On Metric Learning for Audio-Text Cross-Modal Retrieval github with some modifications.