MERT

This is the official implementation of the paper "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training".

Training

The MERT training is implemented with fairseq. You need to clone the fairseq repo inside our repo at ./src/fairseq and MERT implementation codes as a fairseq example projcet.

Environment Setup

The training of MERT requires:

fairseq & pytorch for the training (must)
nnAudio for on-the-fly CQT inference (must)
apex for half-precision training (optaional)
nccl for multiple device training (optional)
fairscale for FSDP and CPU offloading (optional)

You could use the script ./scripts/environment_setup.sh to set up the python environment from scarth, which could be easily modified to DOCKERFILE. All the relevant folders will be placed at the customized MERT repo folder path $MAP_PROJ_DIR.

Data Preparation

The data preparation and format can be referred to HuBERT for more details.

Generally, there are 2 things you need to prepare:

DATA_DIR=${MAP_PROJ_DIR}/data/audio_tsv: a folder that contains a train.tsv and a valid.tsv file, which specify the root path to the audios at the first line and the relative paths at the rest lines.
LABEL_ROOT_DIR=${MAP_PROJ_DIR}/data/labels: a folder filled with all the discrete tokens that need to prepare before training. They could be K-means or RVQ-VAE tokens.

The two options for acoustic teacher peuso labels in MERT training can be constructed by:

K-means Labels from HuBERT (the vanilla MFCC version)
codecs from EnCodec

Start Training

Noted that we follow the fariseq development protocol to put our codes as an example project. When running the fairseq program, you can specify the MERT customized codes by common.user_dir=${MAP_PROJ_DIR}/mert_faiseq.

After the environment is set up, you could use the following scripts:

# for MERT95M
bash scripts/run_training.sh 0 dummy MERT_RVQ-VAE_CQT_95M

# for MERT 330M
bash scripts/run_training.sh 0 dummy MERT_RVQ-VAE_CQT_330M

Inference

We use the huggingface models for interface and evaluation. Using the example of RVQ-VAE 95M MERT as example, the following codes show how to load and extract representations with MERT.

python MERT/scripts/MERT_demo_inference.py

Checkpoints

Huggingface Checkpoint

Our Huggingface Transformers checkpoints for convenient inference are uploaded to the m-a-p project page.

MERT-v0: The base (95M) model trained with K-means acoustic teacher and musical teacher.
MERT-v0-public: The base (95M) model trained with K-means acoustic teacher and musical teacher using the public music4all training data.
MERT-v1-95M: The base (95M) model trained with RVQ-VAE acoustic teacher and musical teacher.
MERT-v1-330M: The large (330M) model trained with RVQ-VAE acoustic teacher and musical teacher.

Fairseq Checkpoint

We also provide the corresponding fairseq checkpoint for continual training or further modification. Coming soon.

Citation

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

TharakaUmayanga/MERT