MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models, CVPR 2024

Resources

🌐 Webpage | 🗂️ Datasets

💡Proposed Architecture

alt text

🛠️ Environment Preparation


Python version 3.11 (while creating conda env): conda create --name melfusion_env python=3.11

conda activate melfusion_env

clone this repository go to the corresponding folder and execute the following commands: 

pip install -r requirements.txt
cd diffusers
pip install -e .
cd audioldm
wget https://huggingface.co/haoheliu/AudioLDM-S-Full/resolve/main/audioldm-s-full
mv audioldm-s-full audioldm-s-full.ckpt

cd ../..
pip install -r requirements2.txt

sudo apt-get install lsof
sudo apt install git-lfs
git lfs install

go to cache and download the following: 
cd ~/.cache   
mkdir audioldm
cd audioldm
wget https://huggingface.co/haoheliu/AudioLDM-S-Full/resolve/main/audioldm-s-full
mv audioldm-s-full audioldm-s-full.ckpt
sudo apt-get install tmux

🔥 To run training:

bash train_mmgen.sh

💊 To run inference:

bash inference_mmgen.sh

📉 Main Results:

alt text

🙏 Acknowledgements

The codebase for this work is built on the Tango and AudioLDM repositories. We would like to thank the respective authors for their contribution.

🎓 Citing MeLFusion

@InProceedings{Chowdhury_2024_CVPR,
    author    = {Chowdhury, Sanjoy and Nag, Sayan and Joseph, K J and Srinivasan, Balaji Vasan and Manocha, Dinesh},
    title     = {MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {26826-26835}
}