/PDMX

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Primary LanguagePythonMIT LicenseMIT

arXiv Zenodo GitHub license

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Public Domain MusicXML

Recent copyright infringement lawsuits against leading music generation companies have sent shockwaves throughout the AI-Music community, highlighting the need for copyright-free training data. Meanwhile, the most prevalent format for symbolic music processing, MIDI, is well-suited for modeling sequences of notes but omits an abundance of extra musical information present in sheet music, which the MusicXML format addresses. To mitigate these gaps, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores. We also introduce MusicRender, an extension of the Python library MusPy's universal Music object, designed specifically to handle MusicXML. The dataset, and further specifics, can be downloaded on Zenodo.


Installation

To access the functionalities that we introduce, please clone the latest version of this repository. Then, install relevant dependencies to the Conda environment my_env with conda env update -n my_env --file environment.yml.

TL;DR

git clone https://github.com/pnlong/PDMX.git
conda env update -n my_env --file PDMX/environment.yml
conda activate my_env

Important Methods

We present a few important contributions to interact with both the PDMX dataset and MusicXML-like files.

MusicRender

We introduce MusicRender, an extension of MusPy's universal Music object, that can hold musical performance directives through its annotations field.

from pdmx import MusicRender

Let's say music is a MusicRender object. We can save music to a JSON or YAML file at the location path:

music.save(path = path)

However, we could just as easily use write(), where path ends with .json or .yaml. The benefit of this method is that we can write music to various other output formats, where the output filetype is inferred from the filetype of path (.wav is audio, .midi is symbolic).

music.write(path = path)

When writing to audio or symbolic formats, performance directive (e.g. dynamics, tempo markings) are realized to their fullest extent. This functionality should not be confused with the music.realize_expressive_features() method, which realizes the directives inside a MusicRender object. This method should not be used explicitly before writing, as it is implicitly called during that process and any directives will be doubly applied.

load()

We store PDMX as JSONified MusicRender objects (see the write() or save() methods above). We can reinstate these objects into Python by reading them with the load() function, which returns a MusicRender object given the path to a JSON or YAML file.

from pdmx import load
music = load(path = path)

read_musescore()

PDMX was created by scraping the public domain content of MuseScore, a score-sharing online platform on which users can upload their own sheet music arrangements in a MusicXML-like format. MusPy alone lacked the ability to fully parse these files. Our read_musescore() function can, and returns a MusicRender object given the path to the MuseScore file.

from pdmx import read_musescore
music = read_musescore(path = path)

Citing & Authors

If you find this repository helpful, feel free to cite our publication PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing:

@article{long2024pdmx,
    title={{PDMX}: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing},
    author={Long, Phillip and Novack, Zachary and Berg-Kirkpatrick, Taylor and McAuley, Julian},
    journal={arXiv:2409.10831},
    year={2024},
}