
MuChoMusic is a benchmark for evaluating music understanding in multimodal audio-language models.

Primary LanguageJupyter NotebookMIT LicenseMIT

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

License: MIT arXiv DOI Huggingface

Benno Weck*1, Ilaria Manco*2,3, Emmanouil Benetos2, Elio Quinton3, George Fazekas2, Dmitry Bogdanov1

1 UPF, 2 QMUL, 3 UMG

* equal contribution

This repository contains code and data for the paper MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models (ISMIR 2024).


Quick Links


The dataset is available to download from Zenodo:

wget -P data https://zenodo.org/record/12709974/files/muchomusic.csv

or via HuggingFace Datasets. You can access it using the πŸ€— Datasets library:

from datasets import load_dataset
MuchoMusic = load_dataset("mulab-mir/muchomusic")

Code setup

To use this code, we recommend creating a new python3 virtual environment:

python -m venv venv 
source venv/bin/activate

Then, clone the repository and install the dependencies:

git clone https://github.com/mulab-mir/muchomusic.git
cd muchomusic
pip install -r requirements.txt

This codebase has been tested with Python 3.11.5.

Code Structure

β”œβ”€β”€ data            
β”‚   └── muchomusic.csv
β”œβ”€β”€ dataset_creation                # code to generate and validate the dataset
β”œβ”€β”€ muchomusic_eval                 # evaluation code
β”‚   β”œβ”€β”€ configs                     # folder to store the config files for evaluation experiments
|   └── ...    
β”œβ”€β”€ evaluate.py                     # run file to run the evaluation
└── prepare_prompts.py

Prepare the model outputs for benchmark

Inputs to the benchmark should be given as a JSON object with the following format:

    "id": 415600,
    "prompt": "Question: What rhythm pattern do the digital drums follow? Options: (A) Four on the floor. (B) Off-beat syncopation. (C) Scat singing. (D) E-guitar playing a  simple melody. The correct answer is: ",
    "answers": [
        "Pop music",
        "Latin rock",
    "answer_orders": [
    "dataset": "sdd",
    "genre": "Reggae",
    "reasoning": [
        "genre and style"
    "knowledge": [],
    "audio_path": "data/sdd/audio/00/415600.2min.mp3",
    "model_output": "A"

To generate this, first run:

python prepare_prompts.py --output_path <path_to_json_file>

Then obtain the model predictions from each (audio, text) pair formed by prompt and the corresponding audio at audio_path, and populate model_output accordingly.

Run the evaluation

python evaluate.py --output_dir <path_to_results_dir>

After running the code, the results will be stored in <path_to_results_dir>.


If you use the code in this repo, please consider citing our work:

   title={MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models},
   author={Weck, Benno and Manco, Ilaria and Benetos, Emmanouil and Quinton, Elio and Fazekas, GyΓΆrgy and Bogdanov, Dmitry},
   booktitle = {Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR)},


This repository is released under the MIT License. Please see the LICENSE file for more details. The dataset is released under the CC BY-SA 4.0 license.


If you have any questions, please get in touch: benno.weck01@estudiant.upf.edu, i.manco@qmul.ac.uk.

If you find a problem when using the code, you can also open an issue.