
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Primary LanguagePython

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

This repository contains the code and dataset accompanying the paper "Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model" by Dr. Jaeyong Kang, Prof. Soujanya Poria, and Prof. Dorien Herremans.


We propose a novel AI-powered multimodal music generation framework called Video2Music. This framework uniquely uses video features as conditioning input to generate matching music using a Transformer architecture. By employing cutting-edge technology, our system aims to provide video creators with a seamless and efficient solution for generating tailor-made background music.

Directory Structure

  • saved_models/: saved model files
  • utilities/
    • run_model_vevo.py: code for running model (AMT)
    • run_model_regression.py: code for running model (bi-GRU)
  • model/
    • video_music_transformer.py: Affective Multimodal Transformer (AMT) model
    • video_regression.py: Bi-GRU regression model used for predicting note density/loudness
    • positional_encoding.py: code for Positional encoding
    • rpr.py: code for RPR (Relative Positional Representation)
  • dataset/
    • vevo_dataset.py: Dataset loader
  • script/ : code for extracting video/music features (sementic, motion, emotion, scene offset, loudness, and note density)
  • train.py: training script (AMT)
  • train_regression.py: training script (bi-GRU)
  • evaluate.py: evaluation script
  • generate.py: inference script


  • Clone this repo

  • Obtain the dataset:

    • MuVi-Sync (features) (Link)
    • MuVi-Sync (original video) (Link)
  • Put all directories started with vevo in the dataset under this folder (dataset/)

  • Download the processed training data AMT.zip from HERE and extract the zip file and put the extracted two files directly under this folder (saved_models/AMT/)

  • Install dependencies pip install -r requirements.txt

    • Our code is built on pytorch version 1.12.1 and Python version 3.7.15 (torch==1.13.1 in the requirements.txt). But you might need to choose the correct version of torch based on your CUDA version


python train.py


python generate.py


If you find this resource useful, please cite the original work:

  title={Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model},
  author={Kang, Jaeyong and Poria, Soujanya and Herremans, Dorien},
  journal={arXiv preprint arXiv:2311.00968},

Kang, J., Poria, S. & Herremans, D. (2023). Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model. arXiv preprint arXiv:2311.00968.


Our code is based on Music Transformer.