The official implementation of "Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training" (accepted by ACM MM 2024, oral, top 3.97% among total submissions).
Our code implementation is based on the retrieval part of WavCaps. Similar to WavCaps, MGA-CLAP is lightweight and can be reproduced on a single RTX 3090 with 24 GB RAM. As for the environment, one can build it following WavCaps.
We provide a well-trained model checkpoint, which can be accessed through Google drive (https://drive.google.com/file/d/1RWTuVMEPy-L0uK6WYIX2wwxHjD1YSQFz/view?usp=drive_link). One can download it, and put it in the pretrained_models/models.
We provide an example to show how to extract frame features and frame-caption correspondence in example.py, remember to modify the checkpoint path in settings/inference_example.yaml. The key code is listed as follow,
# get fine-grained word_level embeddings
_, word_embeds, attn_mask = model.encode_text(classes)
# aggregate word_level embeddings to sentence_level by shared codebook
text_embeds = model.msc(word_embeds, model.codebook, attn_mask)
# get fine-grained frame_level embeddings
_, frame_embeds = model.encode_audio(audio_time_series.unsqueeze(0))
# aggregate frame_level embeddings to clip_level by shared codebook
audio_embeds = model.msc(frame_embeds, model.codebook)
We provide a well-trained model checkpoint model.pt, which can be accessed through Google drive. One can download it, and put it in the pretrained_models/models.
We show several evaluation examples of fine-grained and coarse-grained tasks
-
Sound event detection
see zero_shot_sed.py, it provides the inference code for audioset_strong_eval dataset, remember to modify the data and checkpoint path in settings/inference_sed.yaml
-
Text-to-audio grounding
see zero_shot_grounding.py, it provides the inference code for audio_grounding dataset, remember to modify the data and checkpoint path in settings/inference_grounding.yaml
-
Audio classification
see zero_shot_clas.py, it provides the inference code for esc-50, urbansound8k and vggsound dataset, remember to modify the data and checkpoint path in settings/inference_cls.yaml
-
Prepare the WavCaps dataset as WavCaps.
-
Run the code
The training settings are given in settings/pretrain.yaml. Simply run the code by
python pretrain.py
-
Highlights of our novel designs
-
Modality-shared codebook
can be found in models/ase_model.py
-
Locality-aware block
can be found in models/hts_at.py
-
Hard-negative guided contrastive loss
can be found in tools/losses.py
-
If you want to cite this paper:
Li Y, Guo Z, Wang X, et al. Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training[J]. arXiv preprint arXiv:2408.07919, 2024.