- Clone the respository
- Create conde environment with dependencies:
conda env create -f environment.yaml -n [env-name]&&conda activate [env-name]
- Create a pretrained folder:
mdkir -p pretrained_models/audio_encoder
- Go to
pretrained_models/audio_encoder
and download the pretrained ResNet38 audio encoder model:gdown https://zenodo.org/records/3987831/files/ResNet38_mAP%3D0.434.pth?download=1 -O ResNet38.pth
- Download AudioCaps and Clotho datasets. AudioCaps dataset can be downloaded at link and Clotho dataset can be downloaded at link.
- Unzip datasets and put wavefiles under
data/AudioCaps/waveforms
ordata/Clotho/waveforms
- The training config is in the setting folder
settings/m-ltm-settings.yaml
- Set value of dataset parameter in the config file to etheir "AudioCaps" or "Clotho" to train model on AudioCaps or Clotho dataset.
- Run experiments:
python train.py -n [exp_name] -c m-ltm-settings
- Download the test data of ESC50 from the link
- Run the evaluation:
python trainer/eval_esc50.py -c m-ltm-settings -p [pretrained model's folder]
@inproceedings{
luong2024revisiting,
title={Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation},
author={Manh Luong and Khai Nguyen and Nhat Ho and Reza Haf and Dinh Phung and Lizhen Qu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=l60EM8md3t}
}
- We use the model and training code from On Metric Learning for Audio-Text Cross-Modal Retrieval github with some modifications.