MGA-CLAP

The official implementation of "Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training" (accepted by ACM MM 2024, oral, top 3.97% among total submissions).

Environment

Our code implementation is based on the retrieval part of WavCaps. Similar to WavCaps, MGA-CLAP is lightweight and can be reproduced on a single RTX 3090 with 24 GB RAM. As for the environment, one can build it following WavCaps.

Example

We provide a well-trained model checkpoint, which can be accessed through Google drive (https://drive.google.com/file/d/1RWTuVMEPy-L0uK6WYIX2wwxHjD1YSQFz/view?usp=drive_link). One can download it, and put it in the pretrained_models/models.

We provide an example to show how to extract frame features and frame-caption correspondence in example.py, remember to modify the checkpoint path in settings/inference_example.yaml. The key code is listed as follow,

# get fine-grained word_level embeddings
_, word_embeds, attn_mask = model.encode_text(classes) 
# aggregate word_level embeddings to sentence_level by shared codebook
text_embeds = model.msc(word_embeds, model.codebook, attn_mask) 
# get fine-grained frame_level embeddings
_, frame_embeds = model.encode_audio(audio_time_series.unsqueeze(0)) 
# aggregate frame_level embeddings to clip_level by shared codebook
audio_embeds = model.msc(frame_embeds, model.codebook)

Evaluation

We provide a well-trained model checkpoint model.pt, which can be accessed through Google drive. One can download it, and put it in the pretrained_models/models.

We show several evaluation examples of fine-grained and coarse-grained tasks

Sound event detection

see zero_shot_sed.py, it provides the inference code for audioset_strong_eval dataset, remember to modify the data and checkpoint path in settings/inference_sed.yaml
Text-to-audio grounding

see zero_shot_grounding.py, it provides the inference code for audio_grounding dataset, remember to modify the data and checkpoint path in settings/inference_grounding.yaml
Audio classification

see zero_shot_clas.py, it provides the inference code for esc-50, urbansound8k and vggsound dataset, remember to modify the data and checkpoint path in settings/inference_cls.yaml

Training

Prepare the WavCaps dataset as WavCaps.
Run the code

The training settings are given in settings/pretrain.yaml. Simply run the code by
```
python pretrain.py
```
Highlights of our novel designs
- Modality-shared codebook
  
  can be found in models/ase_model.py
- Locality-aware block
  
  can be found in models/hts_at.py
- Hard-negative guided contrastive loss
  
  can be found in tools/losses.py

Citation

If you want to cite this paper: