In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advanced in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on the Clotho and AudioCaps dataset demonstrating its ability to achieve up to 80% of the performance attained by fully supervised approaches trained on paired target data.
Clone, create environment and install dependencies:
git clone && cd wsac
conda create --name wsac --file requirements.txt
conda activate wsac
Prepare json files with captions from Clotho and AudioCaps datasets
cd data
python3 /path/to/clotho/train.csv /path/to/audiocaps/train.csv
Download CLAP model trained on WavCaps dataset from link.
Train the model using script:
python3 --data data/clotho.json --out_dir trained_models/clotho
The configurations / hyperparameters are the following:
--data Path to training data captions (clotho.json / audiocaps.json)
--clap_path Path to clap model weights
--out_dir Dir to save trained models
--prefix Prefix for saved filenames
--modality_gap_path Path to pickled modality gap vector
--epochs Number of epochs to train
--bs Batch size
--lr Learning rate
--warmup Number of warm-up steps
--wd Weight decay factor
--noise Noise Variance
- For Noise Injection training as described in [1], set --noise equal to the desired noise variance.
- For Embedding Shift training set --modality_gap_path to the path of a pickled modality gab vector.
Evaluated the trained models using script --model_path path/to/trained/ --clap_path path/to/clap/ --dataset clotho --eval_dir path/to/testset/waveforms
The arguments of the evaluation script are the following:
--model_path Path to trained model
--clap_path Path to clap model
--dataset Dataset name (clotho/audiocaps)
--eval_dir Path to dirctory with waveforms from the test set
--method What inference stradegy top use for decoding (ad/nnd/pd)
--mem Path to json with text to use for Memory construciton.
The code used in this repo is heavily based on DeCap and CapDec.