PBSCSR

Piano Bootleg Score Composer Style Recognition - Dataset & Baselines

The purpose of this repository is to recognize the composer of a piece based on the bootleg score, which contains only notehead positions on the staff. This repository contains labeled data, unlabeled data for pretraining, and the code necessary to reproduce the results.

See below for instructions on data replication.

GPT2

Activate baselines virtual environment: conda activate baselines
Run the preprocessing notebook located in /home/username/ttmp/PBSCSR/baselines/LM_pretraining_data_preprocessing.ipynb
Run pretraining notebook located in /home/username/ttmp/PBSCSR/baselines/01_gpt2_pretraining.ipynb to create bash script for pretraining
Run the pretraining bash script located at /home/username/ttmp/PBSCSR_data/pretrained_model/pretrain_lm.sh
Run the jupyter notebook for the GPT2 notebook at gpt2_LP_and_FT.ipynb

CNN

Activate baselines virtual environment: conda activate baselines
Git clone PBSCSR repo into ttmp
CNN jupyter notebook is located in /home/username/ttmp/PBSCSR/baselines/CNN/simple_CNN.ipynb
Specific data replication instructions for 9_way_dataset and 100_way_dataset are in the simple_CNN.ipynb
CNN does not require language modeling (no need to run LM_pretraining_data_preprocessing.ipynb)

RoBERTa

Activate baselines virtual environment: conda activate baselines
Git clone PBSCSR repo into ttmp (if done before, skip this step)
(Optional) Run data_creation.ipynb This jupyter notebook clone the imslp_bootleg_dir-v1 Filter filler, which generate imslp_bootleg_dir-v1.1 This allows you to have a copy of both versions of imslp bootleg.
The PBSCSR repo already has imslp_bootleg_dir-v1.1, you can just point to this directory
Run LM_pretraining_data_preprocessing.ipynb,which is located in /PBSCSR/baselines/LM_pretraining_data_preprocessing.ipynb. Does not matter which version of imslp bootleg you are pointing to in this step because there’s a same filter filler like in step 3 to generate imslp_bootleg_dir-v1.1. (if )
Run 01_roberta_pretraining.ipynb stop after finishing running Language Model Pretraining section.
Before keep going to run Language Model Pretraining Curves section, run the bash script train_lm.sh in the output directory you specify when running Language Model Pretraining
Run the bash script in a persistent shell session (like tmux or screen) with the baselines environment. This process may take about 4-5 hours
After finish running train_lm.sh , you can run the Language Model Pretraining Curves section
You should see that the train data curve (black) is similar to validation data curve (green)
Run roberta/roberta_LP_and_FT.ipynb

Few-Shot

Activate baselines virtual environment: conda activate baselines
Run the embeddings notebook located at /home/username/ttmp/PBSCSR/baselines/fewshot_embeddings.ipynb
Run the experiment notebook located at /home/username/ttmp/PBSCSR/baselines/fewshot_experiment.ipynb

HMC-MIR/PBSCSR

PBSCSR

GPT2

CNN

RoBERTa

Few-Shot