This is the official repository for a technical report A Whisper transformer for audio captioning trained with synthetic captions and transfer learning.
This repository serves to train and evaluate the Whisper model for general audio-scene captioning. The input is a short audio clip, and the output is a brief text description of what is happening.
You can find our checkpoints on Huggingface:
If you find our work useful, cite us as follows:
@misc{kadlčík2023whisper,
title={A Whisper transformer for audio captioning trained with synthetic captions and transfer learning},
author={Marek Kadlčík and Adam Hájek and Jürgen Kieslich and Radosław Winiecki},
year={2023},
eprint={2305.09690},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
Start by creating a conda environment:
git clone --recursive ... # recursive because there is `evaluation_tools` as git submodule
cd audio-captioning
conda create -n malach23 python=3.8
conda activate malach23
pip install -r requirements.txt
pip install -e .
If the last line does not work, update your pip. e.g. pip install --upgrade pip
After you have the environment ready, run the script inside audiocap/evaluation_tools
chmod +x audiocap/evaluation_tools/coco_caption/get_stanford_models.sh
./audiocap/evaluation_tools/coco_caption/get_stanford_models.sh
This will download the data necessary for computing evaluation metrics.
We train on multiple datasets: Audioset (our selected subset), AudioCaps, and finally Clotho. To make it simple to work with multiple datasets downloaded them to convert them into a file structure that is as compatible as possible. We call it AudioFolder, because it is inspired by HuggingFace's AudioFolder or ImageFolder.
While the datasets are not completely compatible (e.g. one caption vs multiple captions per
audio clip), AudioFolder structure and python class audiocap.data.AudioFolder
helps us work
with them in a systematic way. The following sections explain how to get the data and prepare
AudioFolder from them.
Getting the data
mkdir -p data/clotho_v2.1/audiofolder
Download the data from https://zenodo.org/record/4783391 and extract csv into the data/clotho_v2.1
and audios into data/clotho_v2.1/audiofolder
folder. Your tree structure should look like this:
audio-captioning/
├── audiocap
│ ...
...
|
├── data
│ └── clotho_v2.1
│ ├── audiofolder
│ │ ├─ development
│ │ ├─ evaluation
│ │ ├─ test
│ │ └─ validation
│ ├── clotho_captions_development.csv
│ ├── clotho_captions_evaluation.csv
│ ├── clotho_captions_validation.csv
│ ├── clotho_metadata_development.csv
│ ├── clotho_metadata_evaluation.csv
│ ├── clotho_metadata_test.csv
│ └── clotho_metadata_validation.csv
...
Creating AudioFolder
Now, prepare
python audiocap/prepare_audiofolder.py prepare-clotho-audiofolder data/clotho_v2.1/
This will prepare the folder into the format that is easily loadable.
To limit a size of a split (like validation and evaluation), run:
python audiocap/prepare_audiofolder.py limit-clotho-split data/clotho_v2.1/audiofolder/ validation --limit 200
python audiocap/prepare_audiofolder.py limit-clotho-split data/clotho_v2.1/audiofolder/ evaluation --limit 400
This will sample (with a seed) a subset with a desired size and move the remaining examples to the development split.
Getting AudioSet
AudioSet is a large multi-label classification dataset. In our repository, we use information from AudioSet ontology to construct keyword-based synthetic captions. This makes it possible to pretrain a seq2seq captioning model (like Whisper) on AudioSet using an end-to-end supervised training pipeline.
AudioSet annotations are copied into this repository, but audios must be scraped from youtube.
You can use scripts/download_audioset.sh
script that will use all cores to download and
convert audios based on youtube ids.
Make the script executable
chmod +x ./scripts/download_audioset.sh
Download the audio files
SPLIT='train_unbalanced' # run again with 'train_balanced' or 'eval'
mkdir -p logs/download_audioset
./scripts/download_audioset.sh \
"data/audioset_full/csvs/${SPLIT}.csv" \
"data/audioset_full/audios/${SPLIT}/" 2>&1 \
| tee >( sed 's/.*\r//' > "logs/download_audioset/${SPLIT}.txt" )
(sed
is there to delete output lines that just update the progress)
Please note that scraping AudioSet is best-effort only. Videos could be deleted from youtube. Now, you should select a subset of AudioSet that suits your needs. AudioSet is heavily imbalanced, with music and speech ocurring in a vast majority of examples. In our case, we selected around 130k instances that covered as much of the underrepresented classes. However, before we select the subset, we prepare AudioCaps - a different dataset we use for pretraining. This is to prevent a leakage between the two datasets because they have audio files in common.
Getting AudioCaps
AudioCaps is a captioning dataset with much more audios than Clotho (but is arguably of a lower quality).
AudioCaps annotations are also part of this repository. Furthermore, AudioCaps is a subset of AudioSet, so you have all AudioCaps audios prepared once you download AudioSet.
Creating AudioCaps AudioFolder
Run:
python audiocap/prepare_audiofolder.py prepare-audiocaps-audiofolder \
--audiocaps-path data/audiocaps \
--audioset-path data/audioset_full \
--audio-format mp3
This will copy the files from AudioSet, and prepare AudioFolder structure and annotations with dropped records about audios that were listed inside AudioCaps csvs but files were missing (unavailable when you scraped AudioSet).
Creating a balanced AudioSet subset
This part is most intricate. We want at the same time
- a diverse subset
- a balanced subset
- a large subset
- no leakeage with AudioCaps
This is difficult and has no optimal solution. Especially balancing a dataset is difficult when each example has multiple labels.
In this repository, there are some utilities help select it. If you want to select your own subset, you can look into notebooks/select_audioset_subset.ipynb
However, the subset we selected is also available in this repository in data/audioset_small
.
Creating AudioSet-small AudioFolder
Run:
python audiocap/prepare_audiofolder.py prepare-audioset-small-audiofolder \
--audioset-small-path data/audioset_small \
--audioset-full-path data/audioset_full \
--audio-format mp3
Congrats. Now you have all three datasets prepared for training.
During training, corrupted audio files (not loadable by librosa) are skipped.
However, if you want to check corrupted files, you can use the audiocap.data.find_corrupted_audios
.
We train in two phases. We pretrain on a mixture of AudioCaps and AudioSet small, and then finetune on Clotho.
We monitor metrics (into wandb) on each dataset separately and also log some predictions so that one can see the outputs the model generates.
Because we can pretrain using the same audio-to-text objective as we do on finetuning, we can only have a single configurable training script.
AudioSet is originally a classification dataset. During training, we convert the labels on the fly into keyword-based synthetic captions.
CUDA_VISIBLE_DEVICES="..." python \
audiocap/train_whisper_supervised.py \
--checkpoint-dir-root="./checkpoints" \
--audioset-dir="./data/audioset_small/audiofolder" \
--audiocaps-dir="./data/audiocaps/audiofolder" \
--training-config="./configs/pretrain_1on1_large_config.yaml" \
--wandb-group="pretraining"
Argument --training-config
is the most important - it specifies everything important about training.
We experimented with different setups. you can find the different configs inside configs/
folder.
To run finetuning, use the following command:
CUDA_VISIBLE_DEVICES="..." python \
audiocap/train_whisper_supervised.py \
--checkpoint-dir-root="./checkpoints" \
--clotho-dir="./data/clotho_v2.1/audiofolder" \
--training-config="./configs/finetune_large_config.yaml" \
--load-checkpoint="..." \
--wandb-group="finetuning"
--load-checkpoint
is an optional argument that allows initializing the model with weights from local file.
To effectively train on multiple datasets, we put a dataset and task identifiers into the captions.
Example:
- clotho > caption: Fair kind music is being played at the circus grounds.
- audiocaps > caption: The wind is blowing, insects are singing, and rustling occurs
- audioset > keywords: boat - water vehicle, motorboat - speedboat, sounds of things, vehicle
The prefix informs the model about the style of caption that is used. During inference, a prefix is forced to the decoder, which makes the model generate output in a desired style. This is a trick inspired by multilingual generative language models where the prefix specifies the output language.
If you have a trained model, you can run the inference script:
CUDA_VISIBLE_DEVICES="..." python \
audiocap/predict.py \
--checkpoint path/to_checkpoint \
--data path/to/folder/with/audio/files \
--output-file foo.csv \
--config-file configs/predict_config.yaml \
--take-first-n 10 # optional, for debugging purposes
The inference script will generate outputs using the model, print raw outputs into stdout and
clean outputs into a csv file with two columns, file_name
and caption_predicted
.
Raw outputs include invisible tokens such as <|startoftranscript|>
, forced prefix, padding, etc. Clean outputs only contain the content.
Config file specifies both inference hyperparameters (like number of beams) and technical necessities, such as batch size, fp precision or number of loader processes.
For all code in this repository code, licence in LICENSE file applies.
For the files in the data
directory, specific licences apply:
- AudioSet labels: CC BY 4.0
- source of data: https://research.google.com/audioset/
- AudioSet ontology: CC BY-SA 4.0
- source of data: https://research.google.com/audioset/
- AudioCaps labels: MIT
- source of data: https://github.com/cdjkim/audiocaps