Audio Retrieval with Natural Language Queries

This repository is the implementation of Audio Retrieval with Natural Language Queries and it is based on the Use What You Have: Video retrieval using representations from collaborative experts repo. Datasets used in this paper are AudioCaps, CLOTHO, Activity-Net and QuerYD.

More information can be found at our project page: https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/

❗ An extension of this work along with the new SoundDescs dataset for audio retrieval can be found here. ❗

Requirements

We used PyTorch 1.7.1., CUDA 10.1, and Python 3.7 to generate results and models. The required libraries for running this code can be found in requirements/requirements.txt.

conda create --name audio-retrieval python=3.7
conda activate audio-retrieval
pip install -r requirements/requirements.txt

To be able to run the code below, features extracted from various datasets need to be downloaded. If there is not enough space in your working location to store some of these features (for AudioCaps the file is 6GB while the others are under 1GB) then you will need to create a folder called data inside this repository which should be a symlink to a folder with enough space. As an example, run the following from the audio-experts code-base.

ln -s <path-where-data-can-be-saved> data

To download features for each dataset, follow the steps here

Evaluating a pretrained model on multiple seeds and reproducing results

To reproduce the results in tables below, multiple models trained with different seeds need to be downloaded and evaluated on the test sets.

The steps needed to reproduce the results are:

Select the experiment to be reproduced which is in the form <dataset-name>-<config-file-name>. Tables with experiments names and the corresponding form can be found in misc/exps-names.md.
Download the features and splits corresponding to the dataset for which the experiment is run. For example, for AudioCaps run:

# fetch the pretrained experts for AudioCaps 
python3 misc/sync_experts.py --dataset AudioCaps

Additional examples for the datasets used in this paper can be found in misc/exps-names.md.

Running the eval.py script.

For example, to reproduce the experiments for AudioCaps with all visual and audio experts, run the following line:

python eval.py --experiment audiocaps-train-full-ce-r2p1d-inst-vggish-vggsound

If the --experiment flag is not provided, the eval.py script will download and evaluate all models on the test set.

Training a new model

Training a new audio-text embedding requires:

The pretrained experts for the dataset used for training, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually). Examples can be found in misc/exps-names.md.
A config.json file. You can define your own, or use one of the provided configs in the configs directory.

Training is then performed with the following command:

python3 train.py --config <path-to-config.json> --device <gpu-id>

where <gpu-id> is the index of the GPU to train on. This option can be ommitted for training on the CPU.

For example, to train a new embedding for the CLOTHO dataset, run the following sequence of commands:

# fetch the pretrained experts for CLOTHO 
python3 misc/sync_experts.py --dataset CLOTHO

# Train the model
python3 train.py --config configs/clotho/train-vggish-vggsound.json --device 0

AudioCaps

These are the retrieval results obtained for the AudioCaps dataset when using only audio experts:

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
CE - VGGish	t2v	_{^18.0_(0.2)}	_{^46.8_(0.2)}	_{^62.0_(0.5)}	_{^88.5_(0.2)}	_{^6.0_(0.0)}	_{^23.6_(1.3)}	_{^37.4_(0.2)}	7.39M	config, model, log
CE - VGGish	v2t	_{^21.0_(0.8)}	_{^48.3_(1.8)}	_{^62.7_(1.6)}	_{^87.3_(0.4)}	_{^6.0_(0.0)}	_{^27.4_(1.2)}	_{^39.9_(0.6)}	7.39M	config, model, log
CE - VGGSound	t2v	_{^20.5_(0.6)}	_{^52.1_(0.4)}	_{^67.0_(1.0)}	_{^91.1_(1.6)}	_{^5.0_(0.0)}	_{^20.6_(2.8)}	_{^41.5_(0.7)}	12.12M	config, model, log
CE - VGGSound	v2t	_{^24.6_(0.9)}	_{^55.9_(0.3)}	_{^70.4_(0.4)}	_{^92.4_(0.6)}	_{^4.3_(0.6)}	_{^19.9_(1.4)}	_{^45.9_(0.6)}	12.12M	config, model, log
CE - VGGish + VGGSound	t2v	_{^23.1_(0.8)}	_{^55.1_(0.9)}	_{^70.7_(0.7)}	_{^92.9_(0.5)}	_{^4.7_(0.6)}	_{^16.5_(0.6)}	_{^44.8_(0.8)}	21.86M	config, model, log
CE - VGGish + VGGSound	v2t	_{^25.1_(0.9)}	_{^57.1_(1.0)}	_{^73.2_(1.6)}	_{^92.5_(0.2)}	_{^4.0_(0.0)}	_{^17.0_(0.1)}	_{^47.2_(1.1)}	21.86M	config, model, log
MoEE - VGGish + VGGSound	t2v	_{^22.5_(0.3)}	_{^54.4_(0.6)}	_{^69.5_(0.9)}	_{^92.4_(0.4)}	_{^5.0_(0.0)}	_{^17.8_(1.1)}	_{^44.0_(0.4)}	8.9M	config, model, log
MoEE - VGGish + VGGSound	v2t	_{^25.1_(0.8)}	_{^57.5_(1.4)}	_{^72.9_(1.2)}	_{^93.2_(0.8)}	_{^4.0_(0.0)}	_{^15.6_(0.5)}	_{^47.2_(1.0)}	8.9M	config, model, log

Using only visual experts for AudioCaps:

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
CE - Scene	t2v	_{^6.1_(0.4)}	_{^22.6_(0.9)}	_{^35.8_(0.6)}	_{^69.8_(0.4)}	_{^19.3_(0.6)}	_{^69.3_(5.7)}	_{^17.0_(0.5)}	7.51M	config, model, log
CE - Scene	v2t	_{^6.5_(0.8)}	_{^21.8_(1.2)}	_{^31.3_(1.6)}	_{^63.5_(2.1)}	_{^26.1_(2.6)}	_{^121.1_(3.1)}	_{^16.4_(1.0)}	7.51M	config, model, log
CE - R2P1D	t2v	_{^8.2_(0.5)}	_{^28.9_(0.8)}	_{^44.7_(0.9)}	_{^76.6_(1.3)}	_{^12.7_(0.6)}	_{^58.3_(9.2)}	_{^22.0_(0.8)}	6.21M	config, model, log
CE - R2P1D	v2t	_{^10.3_(0.4)}	_{^28.7_(1.5)}	_{^41.8_(3.1)}	_{^75.6_(1.3)}	_{^15.4_(1.5)}	_{^82.0_(7.9)}	_{^23.1_(0.9)}	6.21M	config, model, log
CE - Inst	t2v	_{^7.7_(0.2)}	_{^29.4_(1.3)}	_{^46.7_(1.3)}	_{^79.3_(0.6)}	_{^11.7_(0.6)}	_{^50.8_(3.2)}	_{^21.9_(0.7)}	7.38M	config, model, log
CE - Inst	v2t	_{^9.8_(0.9)}	_{^28.0_(0.7)}	_{^40.6_(0.7)}	_{^74.2_(2.1)}	_{^16.3_(0.6)}	_{^89.4_(3.4)}	_{^22.3_(0.7)}	7.38M	config, model, log
CE - Scene + R2P1D	t2v	_{^8.8_(0.1)}	_{^31.5_(0.5)}	_{^46.8_(0.1)}	_{^77.1_(2.4)}	_{^12.0_(0.0)}	_{^57.8_(8.5)}	_{^23.5_(0.2)}	16.07M	config, model, log
CE - Scene + R2P1D	v2t	_{^11.0_(0.6)}	_{^31.3_(1.7)}	_{^45.1_(1.7)}	_{^75.9_(0.9)}	_{^13.0_(1.0)}	_{^73.0_(5.2)}	_{^25.0_(1.2)}	16.07M	config, model, log
CE - Scene + Inst	t2v	_{^8.7_(0.5)}	_{^30.4_(0.9)}	_{^47.4_(0.5)}	_{^78.8_(1.4)}	_{^11.7_(0.6)}	_{^53.0_(6.4)}	_{^23.2_(0.7)}	17.25M	config, model, log
CE - Scene + Inst	v2t	_{^10.6_(0.6)}	_{^28.0_(1.6)}	_{^41.4_(1.5)}	_{^74.6_(1.0)}	_{^15.3_(1.2)}	_{^85.1_(0.6)}	_{^23.1_(1.2)}	17.25M	config, model, log
CE - R2P1D + Inst	t2v	_{^10.1_(0.2)}	_{^33.2_(0.7)}	_{^49.6_(1.1)}	_{^77.9_(2.3)}	_{^10.7_(0.6)}	_{^57.8_(8.1)}	_{^25.5_(0.2)}	15.95M	config, model, log
CE - R2P1D + Inst	v2t	_{^12.1_(0.4)}	_{^32.2_(0.7)}	_{^46.1_(1.3)}	_{^78.0_(0.8)}	_{^12.8_(0.7)}	_{^71.8_(4.5)}	_{^26.2_(0.5)}	15.95M	config, model, log

Visual and audio experts for AudioCaps:

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
CE - R2P1D + Inst + VGGish	t2v	_{^23.9_(0.7)}	_{^58.8_(0.2)}	_{^74.4_(0.2)}	_{^94.5_(0.2)}	_{^4.0_(0.0)}	_{^14.0_(0.7)}	_{^47.1_(0.5)}	23.32M	config, model, log
CE - R2P1D + Inst + VGGish	v2t	_{^29.0_(2.0)}	_{^63.5_(2.5)}	_{^77.2_(1.9)}	_{^95.0_(0.1)}	_{^3.0_(0.0)}	_{^12.7_(0.1)}	_{^52.2_(2.2)}	23.32M	config, model, log
CE - R2P1D + Inst + VGGSound	t2v	_{^27.4_(0.7)}	_{^62.8_(0.7)}	_{^78.2_(0.3)}	_{^94.9_(0.3)}	_{^3.0_(0.0)}	_{^13.1_(0.6)}	_{^51.3_(0.5)}	28.05M	config, model, log
CE - R2P1D + Inst + VGGSound	v2t	_{^34.0_(1.5)}	_{^68.5_(1.3)}	_{^82.5_(1.2)}	_{^97.3_(0.4)}	_{^2.7_(0.6)}	_{^9.1_(0.3)}	_{^57.7_(1.3)}	28.05M	config, model, log
CE - R2P1D + Inst +VGGish + VGGSound	t2v	_{^28.1_(0.6)}	_{^64.0_(0.5)}	_{^79.0_(0.5)}	_{^95.4_(0.6)}	_{^3.0_(0.0)}	_{^12.1_(1.1)}	_{^52.2_(0.4)}	35.43M	config, model, log
CE - R2P1D + Inst +VGGish + VGGSound	v2t	_{^33.7_(1.6)}	_{^70.2_(0.8)}	_{^83.7_(0.4)}	_{^97.5_(0.1)}	_{^2.7_(0.3)}	_{^8.1_(0.4)}	_{^58.3_(1.2)}	35.43M	config, model, log

CLOTHO

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
CE - VGGish	t2v	_{^4.0_(0.2)}	_{^15.0_(0.9)}	_{^25.4_(0.5)}	_{^61.4_(1.1)}	_{^31.7_(1.5)}	_{^78.2_(2.2)}	_{^11.5_(0.5)}	7.39M	config, model, log
CE - VGGish	v2t	_{^4.8_(0.4)}	_{^15.9_(1.8)}	_{^25.8_(1.7)}	_{^57.5_(2.5)}	_{^35.7_(2.5)}	_{^106.6_(5.7)}	_{^12.5_(1.0)}	7.39M	config, model, log
CE - VGGish + VGGSound	t2v	_{^6.7_(0.4)}	_{^21.6_(0.6)}	_{^33.2_(0.3)}	_{^69.8_(0.3)}	_{^22.3_(0.6)}	_{^58.3_(1.1)}	_{^16.9_(0.2)}	21.86M	config, model, log
CE - VGGish + VGGSound	v2t	_{^7.1_(0.3)}	_{^22.7_(0.6)}	_{^34.6_(0.5)}	_{^67.9_(2.3)}	_{^21.3_(0.6)}	_{^72.6_(3.4)}	_{^17.7_(0.4)}	21.86M	config, model, log
MoEE - VGGish + VGGSound	t2v	_{^6.0_(0.1)}	_{^20.8_(0.7)}	_{^32.3_(0.3)}	_{^68.5_(0.5)}	_{^23.0_(0.0)}	_{^60.2_(0.8)}	_{^16.0_(0.3)}	8.9M	config, model, log
MoEE - VGGish + VGGSound	v2t	_{^7.2_(0.5)}	_{^22.1_(0.7)}	_{^33.2_(1.1)}	_{^67.4_(0.3)}	_{^22.7_(0.6)}	_{^71.8_(2.3)}	_{^17.4_(0.7)}	8.9M	config, model, log

Pretraining on AudioCaps, finetuning on CLOTHO

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
CE - VGGish + VGGSound	t2v	_{^9.6_(0.3)}	_{^27.7_(0.5)}	_{^40.1_(0.7)}	_{^75.0_(0.8)}	_{^17.0_(1.0)}	_{^48.4_(0.7)}	_{^22.0_(0.3)}	21.86M	config, model, log
CE - VGGish + VGGSound	v2t	_{^10.7_(0.6)}	_{^29.0_(1.9)}	_{^40.8_(1.4)}	_{^73.5_(2.5)}	_{^16.0_(1.7)}	_{^58.9_(3.8)}	_{^23.3_(1.1)}	21.86M	config, model, log
MoEE - VGGish + VGGSound	t2v	_{^8.6_(0.4)}	_{^27.0_(0.5)}	_{^39.3_(0.7)}	_{^74.4_(0.5)}	_{^17.3_(0.6)}	_{^49.0_(1.0)}	_{^20.9_(0.5)}	8.9M	config, model, log
MoEE - VGGish + VGGSound	v2t	_{^10.0_(0.3)}	_{^27.7_(0.9)}	_{^40.1_(1.3)}	_{^73.5_(1.0)}	_{^16.0_(1.0)}	_{^55.9_(1.8)}	_{^22.3_(0.0)}	8.9M	config, model, log

Visual centric datasets

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
CE - VGGish QuerYD	t2v	_{^3.7_(0.2)}	_{^11.7_(0.4)}	_{^17.3_(0.6)}	_{^36.3_(0.3)}	_{^115.5_(5.2)}	_{^273.5_(6.7)}	_{^9.0_(0.0)}	7.39M	config, model, log
CE - VGGish QuerYD	v2t	_{^3.8_(0.2)}	_{^11.5_(0.4)}	_{^16.8_(0.2)}	_{^35.2_(0.4)}	_{^116.3_(2.1)}	_{^271.9_(5.8)}	_{^9.0_(0.3)}	7.39M	config, model, log
CE - VGGish Activity-Net	t2v	_{^1.5_(0.1)}	_{^5.6_(0.2)}	_{^9.2_(0.3)}	_{^22.1_(1.2)}	_{^373.0_(46.5)}	_{^907.8_(56.2)}	_{^4.0_(0.1)}	7.39M	config, model, log
CE - VGGish Activity-Net	v2t	_{^1.4_(0.1)}	_{^5.3_(0.1)}	_{^8.5_(0.3)}	_{^21.9_(1.3)}	_{^370.0_(40.5)}	_{^912.1_(51.6)}	_{^4.3_(0.1)}	7.39M	config, model, log

More information can be found at our project page: https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/

This code is based on https://github.com/albanie/collaborative-experts

References

[1] If you find this code useful, please consider citing:

@inproceedings{Oncescu21a,
               author       = "Oncescu, A.-M. and Koepke, A.S. and Henriques, J. and Akata, Z., Albanie, S.",
               title        = "Audio Retrieval with Natural Language Queries",
               booktitle    = "INTERSPEECH",
               year         = "2021"
             }

[2] If you find this code useful, please consider citing:

@inproceedings{Liu2019a,
  author    = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},
  booktitle = {arXiv preprint arxiv:1907.13487},
  title     = {Use What You Have: Video retrieval using representations from collaborative experts},
  date      = {2019},
}

oncescuandreea/audio-retrieval

Audio Retrieval with Natural Language Queries

❗ An extension of this work along with the new SoundDescs dataset for audio retrieval can be found here. ❗

Requirements

Evaluating a pretrained model on multiple seeds and reproducing results

Training a new model

AudioCaps

These are the retrieval results obtained for the AudioCaps dataset when using only audio experts:

Using only visual experts for AudioCaps:

Visual and audio experts for AudioCaps:

CLOTHO

Pretraining on AudioCaps, finetuning on CLOTHO

Visual centric datasets

More information can be found at our project page: https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/

This code is based on https://github.com/albanie/collaborative-experts

References