/MGA-CLAP

official implementation of MGA-CLAP (ACM MM 2024)

Primary LanguagePython

MGA-CLAP

The official implementation of "Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training" (accepted by ACM MM 2024, oral, top 3.97% among total submissions).

License: MIT

Environment

Our code implementation is based on the retrieval part of WavCaps. Similar to WavCaps, MGA-CLAP is lightweight and can be reproduced on a single RTX 3090 with 24 GB RAM. As for the environment, one can build it following WavCaps.

Example

We provide a well-trained model checkpoint, which can be accessed through Google drive (https://drive.google.com/file/d/1RWTuVMEPy-L0uK6WYIX2wwxHjD1YSQFz/view?usp=drive_link). One can download it, and put it in the pretrained_models/models.

We provide an example to show how to extract frame features and frame-caption correspondence in example.py, remember to modify the checkpoint path in settings/inference_example.yaml. The key code is listed as follow,

# get fine-grained word_level embeddings
_, word_embeds, attn_mask = model.encode_text(classes) 
# aggregate word_level embeddings to sentence_level by shared codebook
text_embeds = model.msc(word_embeds, model.codebook, attn_mask) 
# get fine-grained frame_level embeddings
_, frame_embeds = model.encode_audio(audio_time_series.unsqueeze(0)) 
# aggregate frame_level embeddings to clip_level by shared codebook
audio_embeds = model.msc(frame_embeds, model.codebook) 

Evaluation

We provide a well-trained model checkpoint model.pt, which can be accessed through Google drive. One can download it, and put it in the pretrained_models/models.

We show several evaluation examples of fine-grained and coarse-grained tasks

  • Sound event detection

    see zero_shot_sed.py, it provides the inference code for audioset_strong_eval dataset, remember to modify the data and checkpoint path in settings/inference_sed.yaml

  • Text-to-audio grounding

    see zero_shot_grounding.py, it provides the inference code for audio_grounding dataset, remember to modify the data and checkpoint path in settings/inference_grounding.yaml

  • Audio classification

    see zero_shot_clas.py, it provides the inference code for esc-50, urbansound8k and vggsound dataset, remember to modify the data and checkpoint path in settings/inference_cls.yaml

Training

  • Prepare the WavCaps dataset as WavCaps.

  • Run the code

    The training settings are given in settings/pretrain.yaml. Simply run the code by

    python pretrain.py
    
  • Highlights of our novel designs

    • Modality-shared codebook

      can be found in models/ase_model.py

    • Locality-aware block

      can be found in models/hts_at.py

    • Hard-negative guided contrastive loss

      can be found in tools/losses.py

Citation

If you want to cite this paper:

Li Y, Guo Z, Wang X, et al. Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training[J]. arXiv preprint arXiv:2408.07919, 2024.