Audio Captioning unofficial datasets source code for AudioCaps [1], Clotho [2], MACS [3], and WavCaps [4], designed for PyTorch.
pip install aac-datasets
If you want to check if the package has been installed and the version, you can use this command:
aac-datasets-info
from aac_datasets import Clotho
dataset = Clotho(root=".", download=True)
item = dataset[0]
audio, captions = item["audio"], item["captions"]
# audio: Tensor of shape (n_channels=1, audio_max_size)
# captions: list of str
from torch.utils.data.dataloader import DataLoader
from aac_datasets import Clotho
from aac_datasets.utils import BasicCollate
dataset = Clotho(root=".", download=True)
dataloader = DataLoader(dataset, batch_size=4, collate_fn=BasicCollate())
for batch in dataloader:
# batch["audio"]: list of 4 tensors of shape (n_channels, audio_size)
# batch["captions"]: list of 4 lists of str
...
Here is the statistics for each dataset :
AudioCaps | Clotho | MACS | WavCaps | |
---|---|---|---|---|
Subsets | train, val, test | dev, val, eval, dcase_aac_test, dcase_aac_analysis, dcase_t2a_audio, dcase_t2a_captions | full | as, as_noac, bbc, fsd, fsd_nocl, sb |
Sample rate (kHz) | 32 | 44.1 | 48 | 32 |
Estimated size (GB) | 43 | 53 | 13 | 941 |
Audio source | AudioSet | FreeSound | TAU Urban Acoustic Scenes 2019 | AudioSet, BBC Sound Effects, FreeSound, SoundBible |
For Clotho, the dev subset should be used for training, val for validation and eval for testing.
Here is the train subset statistics for AudioCaps, Clotho and MACS datasets :
AudioCaps/train | Clotho/dev | MACS/full | |
---|---|---|---|
Nb audios | 49,838 | 3,840 | 3,930 |
Total audio duration (h) | 136.61 | 24.0 | 10.9 |
Audio duration range (s) | 0.5-10 | 15-30 | 10 |
Nb captions per audio | 1 | 5 | 2-5 |
Nb captions | 49,838 | 19,195 | 17,275 |
Total nb words2 | 402,482 | 217,362 | 160,006 |
Sentence size2 | 2-52 | 8-20 | 5-40 |
1 This duration is estimated on the total duration of 46230/49838 files of 126.7h.
2 The sentences are cleaned (lowercase+remove punctuation) and tokenized using the spacy tokenizer to count the words.
This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions.
Python requirements are automatically installed when using pip on this repository.
torch >= 1.10.1
torchaudio >= 0.10.1
py7zr >= 0.17.2
pyyaml >= 6.0
tqdm >= 4.64.0
huggingface-hub >= 0.15.1
numpy >= 1.21.2
The external requirements needed to download AudioCaps are ffmpeg and youtube-dl (yt-dlp should work too).
These two programs can be download on Ubuntu using sudo apt install ffmpeg youtube-dl
.
You can also override their paths for AudioCaps:
from aac_datasets import AudioCaps
dataset = AudioCaps(
download=True,
ffmpeg_path="/my/path/to/ffmpeg",
ytdl_path="/my/path/to/youtube_dl",
)
To download a dataset, you can use download
argument in dataset construction :
dataset = Clotho(root=".", subset="dev", download=True)
However, if you want to download datasets from a script, you can also use the following command :
aac-datasets-download --root "." clotho --subsets "dev"
[1] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019. Available: https://aclanthology.org/N19-1011/
[2] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning Dataset,” arXiv:1910.09387 [cs, eess], Oct. 2019, Available: http://arxiv.org/abs/1910.09387
[3] F. Font, A. Mesaros, D. P. W. Ellis, E. Fonseca, M. Fuentes, and B. Elizalde, Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021). Barcelona, Spain: Music Technology Group - Universitat Pompeu Fabra, Nov. 2021. Available: https://doi.org/10.5281/zenodo.5770113
[1] X. Mei et al., “WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research,” arXiv preprint arXiv:2303.17395, 2023, [Online]. Available: https://arxiv.org/pdf/2303.17395.pdf
If you use this software, please consider cite it as below :
@software{
Labbe_aac_datasets_2022,
author = {Labbé, Etienne},
license = {MIT},
month = {09},
title = {{aac-datasets}},
url = {https://github.com/Labbeti/aac-datasets/},
version = {0.4.0},
year = {2023}
}
Maintainer:
- Etienne Labbé "Labbeti": labbeti.pub@gmail.com