Audio Captioning datasets for PyTorch

Audio Captioning unofficial datasets source code for AudioCaps [1], Clotho [2], MACS [3], and WavCaps [4], designed for PyTorch.

Installation

pip install aac-datasets

If you want to check if the package has been installed and the version, you can use this command:

aac-datasets-info

Examples

Create Clotho dataset

from aac_datasets import Clotho

dataset = Clotho(root=".", download=True)
item = dataset[0]
audio, captions = item["audio"], item["captions"]
# audio: Tensor of shape (n_channels=1, audio_max_size)
# captions: list of str

Build PyTorch dataloader with Clotho

from torch.utils.data.dataloader import DataLoader
from aac_datasets import Clotho
from aac_datasets.utils import BasicCollate

dataset = Clotho(root=".", download=True)
dataloader = DataLoader(dataset, batch_size=4, collate_fn=BasicCollate())

for batch in dataloader:
    # batch["audio"]: list of 4 tensors of shape (n_channels, audio_size)
    # batch["captions"]: list of 4 lists of str
    ...

Download datasets

To download a dataset, you can use download argument in dataset construction :

dataset = Clotho(root=".", subset="dev", download=True)

However, if you want to download datasets from a script, you can also use the following command :

aac-datasets-download --root "." clotho --subsets "dev"

Datasets information

Here is the statistics for each dataset :

Dataset	Sampling rate (kHz)	Estimated size (GB)	Source	Subsets
AudioCaps	32	43	AudioSet	`train` `val` `test` `train_v2`
Clotho	44.1	53	Freesound	`dev` `val` `eval` `dcase_aac_test` `dcase_aac_analysis` `dcase_t2a_audio` `dcase_t2a_captions`
MACS	48	13	TAU Urban Acoustic Scenes 2019	`full`
WavCaps	32	941	AudioSet BBC Sound Effects FreeSound SoundBible	`as` `as_noac` `bbc` `fsd` `fsd_nocl` `sb`

For Clotho, the dev subset should be used for training, val for validation and eval for testing.

Here is additional statistics on the train subset for AudioCaps, Clotho and MACS:

	AudioCaps/train	Clotho/dev	MACS/full	WavCaps/full
Nb audios	49,838	3,840	3,930	403,050
Total audio duration (h)	136.6¹	24.0	10.9	7563.3
Audio duration range (s)	0.5-10	15-30	10	1-67,109
Nb captions per audio	1	5	2-5	1
Nb captions	49,838	19,195	17,275	403,050
Total nb words²	402,482	217,362	160,006	3,161,823
Sentence size²	2-52	8-20	5-40	2-38
Vocabulary²	4724	4369	2721	24600

¹ This duration is estimated on the total duration of 46230/49838 files of 126.7h.

² The sentences are cleaned (lowercase+remove punctuation) and tokenized using the spacy tokenizer to count the words.

Requirements

This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux-based distributions.

Python packages

Python requirements are automatically installed when using pip on this repository.

torch >= 1.10.1
torchaudio >= 0.10.1
py7zr >= 0.17.2
pyyaml >= 6.0
tqdm >= 4.64.0
huggingface-hub >= 0.15.1
numpy >= 1.21.2

External requirements (AudioCaps only)

The external requirements needed to download AudioCaps are ffmpeg and yt-dlp. ffmpeg can be install on Ubuntu using sudo apt install ffmpeg and yt-dlp from the official repo.

You can also override their paths for AudioCaps:

from aac_datasets import AudioCaps
dataset = AudioCaps(
    download=True,
    ffmpeg_path="/my/path/to/ffmpeg",
    ytdl_path="/my/path/to/ytdlp",
)

Additional information

Compatibility with audiocaps-download

If you want to use audiocaps-download 1.0 package to download AudioCaps, you will have to respect the AudioCaps folder tree:

from audiocaps_download import Downloader
root = "your/path/to/root"
downloader = Downloader(root_path=f"{root}/AUDIOCAPS/audio_32000Hz/", n_jobs=16)
downloader.download(format="wav")

Then disable audio download and set the correct audio format before init AudioCaps :

from aac_datasets import AudioCaps
dataset = AudioCaps(
    root=root,
    subset="train",
    download=True,
    audio_format="wav",
    download_audio=False,  # this will only download labels and metadata files
)

References

AudioCaps

[1] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019. Available: https://aclanthology.org/N19-1011/

Clotho

[2] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning Dataset,” arXiv:1910.09387 [cs, eess], Oct. 2019, Available: http://arxiv.org/abs/1910.09387

MACS

[3] F. Font, A. Mesaros, D. P. W. Ellis, E. Fonseca, M. Fuentes, and B. Elizalde, Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021). Barcelona, Spain: Music Technology Group - Universitat Pompeu Fabra, Nov. 2021. Available: https://doi.org/10.5281/zenodo.5770113

WavCaps

[4] X. Mei et al., “WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research,” arXiv preprint arXiv:2303.17395, 2023, [Online]. Available: https://arxiv.org/pdf/2303.17395.pdf

Cite the aac-datasets package

If you use this software, please consider cite it as "Labbe, E. (2013). aac-datasets: Audio Captioning datasets for PyTorch.", or use the following BibTeX citation:

@software{
    Labbe_aac_datasets_2024,
    author = {Labbé, Etienne},
    license = {MIT},
    month = {01},
    title = {{aac-datasets}},
    url = {https://github.com/Labbeti/aac-datasets/},
    version = {0.5.0},
    year = {2024}
}

Contact

Maintainer:

Etienne Labbé "Labbeti": labbeti.pub@gmail.com

kevin-cgc/aac-datasets