MIDI Dataset
The goal of this project is to match and align a very large collection of MIDI files to a very large collection of audio files so that the MIDI data can be used to infer ground truth information about the audio. Alternatively, this repository contains code for reproducing most of the results in [1], which describes the goals, ideas, and research behind this project in much greater detail.
Notes
-
If you're looking for a high-level overview of the techniques used in this project and the results, take a look at chapter 1 of my thesis [1].
-
This repository contains code for performing the matching; if you're looking for the "Lakh MIDI Dataset" itself (the result of using this code to match a collection of 178,561 MIDI files to the Million Song Dataset), you can find that here.
-
If you just want a tutorial on potential uses of the Lakh MIDI dataset, take a look at the Tutorial.ipynb notebook.
-
Over time, this project has undergone some restructuring; if you're looking for the version of this repository used in the experiments in [2], check this tag.
Prerequisites
Before utilizing the code in this repository, you need to gather some data and software.
Data
Create a folder called data
in the root of this repository. In it, you need the following subdirectories:
clean_midi
, which should contain the "clean MIDI subset", as described in section 5.2.1 of [1]. These MIDI files should live indata/clean_midi/mid
. You can obtain this collection here.unique_midi
, which should contain LMD-full, the 176,581 files of the Lakh MIDI dataset (akaLMD-full
). These MIDI files should live indata/unique_midi/mid
. You can obtain this collection here.uspop2002
,cal10k
,cal500
, andmsd
, which should each contain audio files from each respective dataset (msd
being the 7digital preview clips corresponding to the Million Song Dataset). The MP3 files should live in, e.g.,data/uspop2002/mp3
. Unfortunately, obtaining these MP3 files is non-trivial. If you need help tracking them down, please contact me directly.
File lists
All of the datasets in the data
subdirectory (except for unique_midi
) should have a corresponding file list in the file_lists
subdirectory. The only one which is not included in this repository is msd.txt
; you can obtain that from the MSD directly (it's distributed with the MSD as unique_tracks.txt
) or you can also download it here and rename msd.txt
.
Software
All of the code in this repository is written for Python 2.7; it will likely need modification to work with Python 3.x. Here is a potentially incomplete list of the Python libraries used in this project:
numpy
scipy
librosa
pretty_midi
whoosh
joblib
deepdish
dhs
pse
msgpack
msgpack_numpy
lasagne
theano
sklearn
djitw
simple_spearmint
spearmint
Hardware
All of this code was designed to be run on a server with 64 GB of ram, 12 CPU cores, an NVIDIA GTX 980 Ti GPU, and plenty of hard drive space. If your own setup has less resources, you may need to modify some of the scripts in various places so that they use an appopriate amount of RAM, parallel processes, etc. In any case, please note that running all of the experiments and steps from beginning to end will take a least a few weeks of compute time.
Process
The general structure of this repository is as follows: Collections of shared utilities (experiment_utils.py
, feature_extraction.py
, whoosh_search.py
) live in the base level, one-time-use scripts for assembling data and performing the actual MIDI-to-audio matching live in the scripts
directory, and experiments for evaluating the effectiveness of different matching techniques live in experiments
. Any data/results generated by running these different files are written out to a results
directory. To re-run all of the experiments, matching, etc., proceed as described below.
- Run
create_whoosh_indices.py
. This uses the file lists to create Whoosh indices, which allow for fuzzy text matching of metadata. We use this fuzzy text matching to create training data for different matching algorithms. The indices are written out to, e.g.,data/msd/index/
. - Run
text_match_datasets.py
. This uses the Whoosh indices to match MIDI files fromclean_midi
(which ostensibly may have reliable metadata) to entries in the different audio datasets. It also takes care to group audio files which are recordings of the same song. The results are written toresults/text_matches.js
. - Run
create_msd_cqts.py
. This pre-computes constant-Q spectrograms for every entry in the Million Song Dataset, which saves time later on as we will need these for various steps throughout the process. They are written todata/msd/h5
. - Run
align_text_matches.py
. This uses dynamic time warping (specifically the approach proposed in [3]) to align each MIDI-audio pair found by metadata matching. The results are written toresults/clean_midi_aligned
, and include both the aligned MIDI files inresults/clean_midi_aligned/mid
and "diagnostics files" inresults/clean_midi_aligned/h5
. The diagnostics files contain information about whether each match is truly a match (an incorrect match can be caused e.g. by incorrect metadata or a bad transcription). - Run
split_training_data.py
. This splits the matches into train, validation, development, and test collections which are used for evaluating each of the different matching approaches implemented inexperiments
. - Run
create_training_data.py
. This inspects the results ofalign_text_matches.py
to find good matches and generates training data for different matching approaches in a convenient format. It essentially produces saved constant-Q spectrograms of audio files, aligned MIDI files, unaligned MIDI files, and aligned MIDI piano rolls, in various folders inresults
. - Run the experiments! Each subdirectory in the
experiments
directory corresponds to a different MIDI-audio matching technique. Each of these experiments at least contains a script calledmatch_msd.py
, which uses the matching technique to match each MIDI file in either the development or test set to the MSD and writes out the results. Most of the experiments have a script calledprecompute.py
, which precomputes any necessary features/representation of entries in the development and test set. Finally, those experiments which are based on machine learning techniques also have a scriptparameter_search.py
which trains any models necessary for performing the matching. In short, to run each of these experiments, runparameter_search.py
if it exists, runprecompute.py
, and finally runmatch_msd.py
. The results can be used to measure the effectiveness of each approach. There isn't a script which performs this analysis automatically, but there is a great deal of analysis in my thesis [1]. - To actually match the
unique_midi
collection to the Million Song Dataset, use thematch.py
script. For flexibility, this script takes a few command line arguments - first, a glob to MIDI files you want to match, and second, a path to where to write the results. To match the entireunique_midi
dataset to the MSD, call it like so:python match.py ../data/unique_midi/mid/*/\*.mid output_path
. This will produce (inoutput_path
) one file for each MIDI file processed which lists potential matches in the MSD and the corresponding confidence scores. - To assemble a collection of matched-and-aligned MIDI files, use the script
assemble_aligned_matches.py
. This will find all MIDI-audio matches produced bymatch.py
which have a sufficiently high confidence score, re-align them, and write out the aligned MIDI file, along with the unaligned MIDI, MP3 file, and MSD H5, for convenience. In essence, this is how, at long last, each component of the Lakh MIDI dataset is produced.
References
- Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
- Colin Raffel and Daniel P. W. Ellis. "Large-Scale Content-Based Matching of MIDI and Audio Files". Proceedings of the 16th International Society for Music Information Retrieval Conference, 2015.
- Colin Raffel and Daniel P. W. Ellis. "Optimizing DTW-Based Audio-to-MIDI Alignment and Matching". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
- Colin Raffel and Daniel P. W. Ellis. "Pruning Subsequence Search with Attention-Based Embedding". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.