/BeatPD-CLSP-JHU

BeatPD-CLSP-JHU

Primary LanguageJupyter Notebook

BEATPD

This GitHub repository contains the code to reproduce the results obtained by the team JHU-CLSP during the BeatPD challenge. Data description and copyright:

These data were generated by participants of The Michael J. Fox Foundation for Parkinson's Research Mobile or Wearable Studies. They were obtained as part of the Biomarker & Endpoint Assessment to Track Parkinson's Disease DREAM Challenge (through Synapse ID syn20825169) made possible through partnership of The Michael J. Fox Foundation for Parkinson's Research, Sage Bionetworks, and BRAIN Commons

The challenge had 4 submission rounds before the final submission. Hereafter, they are addressed as 1st submission, 2nd submission, 3rd submission, 4th submission, final submission.

For the final submission, we submitted:

  • ON/OFF:
    • CIS-PD: same as 3rd submission + foldaverage
    • REAL-PD: same as 3rd submission + foldaverage
  • Tremor:
    • CIS-PD: same as 3rd submission
    • REAL-PD: same as 3rd submission
  • Dyskinesia:
    • CIS-PD: same as 4th submission
    • REAL-PD: same as 3rd submission

This README walks you through re-creating our final submission. For detailed write-up and to re-create all submissions, please follow our wiki documentation.


Approaches

We have followed 3 approaches during the course of all our submissions :

Summary of the three approaches followed

Please note that due to a lack of development dataset, for all approaches, we performed 5-fold cross-validation and analyzed results of each dataset (CIS-PD and Real-PD) separately.

  • Approach I : TSFRESH + XGBOOST

tsfresh extracts statistical data features from the signal and xgboost handles tabular data extraction and uses decision trees to select important features and combine them to make strong predictions.

  • Approach II : AE + i-vector + Support Vector Regression (SVR)

The problem with using Deep Neural Network-based techniques directly on signals from wearble devices is that there is only one label for a 20 minute file. So, the first step is to reduce the raw signal to features. We used an DNN based auto-encoder (AE) to extract features. Later we use respresentatial learning method called i-vector to convert the features into a vector of fixed size, regardless of the length of the signal. In this way, we used a combination of trained AE and i-vector extractor to obtain a single (fixed sized) vector per signal. Using i-vectors as features, we used Support Vector Regression (SVR) with linear kernel to predict the labels.

  • Approach III : Fusion

A fusion of the predictions from Approach I and Approach II was done using either:

  • Gradient boosting regression. The regressor was trained with the predicted labels from the testing folds from cross-validations.
  • Average of predictions



Step-By-Step guide for setting up environment and data

This step-by-step guide will cover the following steps:

Clone our repository from git :

In your terminal, run the following command to clone our repository from git

$ git clone https://github.com/Mymoza/BeatPD-CLSP-JHU.git

Set up the environment :

We use python for majority of our scripts. We use jupyter notebook to facilitate an interactive envirnment. To run our scripts, please create an environment using requirements.txt file by following these steps:

$ conda create --name BeatPD python=3.5 --file requirements.txt
$ conda activate BeatPD 
  • Note: Make sure that the Jupyter notebook is running on BeatPD kernel.

If the conda environment isn't showing in Jupyter kernels (Kernel > Change Kernel > BeatPD), run:

$ ipython kernel install --user --name=BeatPD

You will then be able to select BeatPD as your kernel.

Install kaldi :

You need to install Kaldi. For installation, you can use either the official install instructions or the easy install instructions if you find the official one difficult to follow.




Data Pre-Processing

First step is to prepare the data given by the challenge. All the steps to do pre-processing on the data is done in the Jupyter Notebook prepare_data.ipynb.

  1. Download the training_data, the ancillary_data and the testing_data from the challenge website
  2. mkdir BeatPD_data Create a folder to contain all the data for the challenge. Put all the files .tar.bz2 you just downloaded for the challenge in this newly created folder, as well as cis-pd.CIS-PD_Test_Data_IDs.csv and real-pd.REAL-PD_Test_Data_IDs.csv.
BeatPD_data $ ls
cis-pd.ancillary_data.tar.bz2    real-pd.ancillary_data_updated.tar.bz2
cis-pd.CIS-PD_Test_Data_IDs.csv  real-pd.data_labels.tar.bz2
cis-pd.data_labels.tar.bz2       real-pd.REAL-PD_Test_Data_IDs.csv
cis-pd.testing_data.tar.bz2      real-pd.testing_data_updated.tar.bz2
cis-pd.training_data.tar.bz2     real-pd.training_data_updated.tar.bz2
  1. Open the notebook prepare_data.ipynb
  2. Change the data_dir variable for the absolute path to the folder BeatPD_data that contains the data given by the challenge.
  3. Execute the cells under Extract initial data and you should have the following directories when it's done:
<path-to-BeatPD_data>  $ ls
cis-pd.ancillary_data  cis-pd.testing_data   real-pd.ancillary_data  real-pd.testing_data
cis-pd.data_labels     cis-pd.training_data  real-pd.data_labels     real-pd.training_data
  1. Execute the rest of the cells in the Notebook. It will create several folders needed to reproduce the experiments. The data directory structure is documented in the wiki.



Code for all approaches

Approach I : tsfresh + xgboost

For this scheme, all the files are in <path-github-repo>/tsfresh/submit/.

|-- run.sh : CIS-PD - Submission 3 - run the tsfresh + xgboost scheme without per patient tuning 
|-- run_perpatient.sh : CIS-PD - Submission 4 - run the tsfresh + xgboost scheme with per patient tuning
|-- run_realpd.sh : REAL-PD - Submission 4 - run the tsfresh + xgboost scheme without per patient tuning  
|
|-- data: Challenge data
     |-- label.csv  
|-- exp: Feature extraction jobs that were divided in 32 subsets
|-- features: Folder containing the extracted features
     |-- cis-pd.training.csv
     |-- cis-pd.testing.csv 
|
|-- mdl:
     |-- cis-pd.conf : best config for the three subchallenges 
     |-- cis-pd.****.conf : best config tuned per patient for the three subchallenges 
|
|-- src: Folder containing the files to generate features and predictions 
     |
     |--- generator.py: Feature extraction for CIS 
     |
     |--- gridsearch.py: Find best hyperparams and save them to a file
                         (same params for all subjects)
     |
     |--- gridsearch_perpatient.py: Find best hyperparams for each subject
                                    and save them to a file
     |
     |--- predict.py: Predicts and creates submission files
     |--- predict_perpatient.py: Predict with perpatient tuning 
|
|-- submission: Folder containing the CSV files with predictions to submit
|-- utils: soft link to kaldi/egs/wsj/s5/utils/

Prepare the environment and create a symbolic link:

  1. Create a softlink from tsfresh/submit/utils/ to kaldi/egs/wsj/s5/utils/.
  2. cd tsfresh/submit/
  3. conda create --name BeatPD_xgboost --file tsfresh_xgboost_environment.yml
  4. conda activate BeatPD_xgboost
  5. In the data/ folder, add BEAT-PD_SC1_OnOff_Submission_Template.csv, BEAT-PD_SC2_Dyskinesia_Submission_Template.csv and BEAT-PD_SC3_Tremor_Submission_Template.csv downloaded from the challenge

As you can see in our write-up, for the final submission, the following sections need to be generated to create predictions files for tsfresh.

The following sections explains how to reproduce our final submission.

ON/OFF - Submission 5 (final submission) for CIS-PD and REAL-PD

Instead of training one model on whole training set, we used our 5-fold to get five different models. We averaged predictions from those five models. The benefit of this approach is that for each model, we can use the test fold to do the early stop to avoid overfitting. Also combination of five systems may improve the overall performance.

  1. In run_foldaverage.sh, edit the absolute path to the CIS-PD_Test_Data_IDs_Labels.csv and REAL-PD_Test_Data_IDs_Labels.csv that are currently hardcoded.
  2. Run run_foldaverage.sh, which will run the necessary code for both databases. It will create the following files:
    1. submission/cis-pd.on_off_new.csv files containing predictions on the test subset for CIS-PD.
    2. submission/<watchgyr - watchacc - phoneacc>_on_off.csv : For REAL-PD on test subset

Tremor - Submission 3 for CIS-PD & REAL-PD

  1. In run.sh, in the section to generate submission files, edit the absolute path to the CIS-PD_Test_Data_IDs_Labels.csv that is currently hardcoded.
  2. Run ./run.sh. You might need to make some changes to this file. It is written to be ran on a grid engine.
    • It will split the CIS-PD training and testing csv files into 32 subsets and submit 32 jobs to do feature extraction. Then, it will merge all of them to store the features in the features/ directory. This step only need to be ran once.
    • Then it will perform a GridSearch, saving the best config
    • Finally, it will create predictions files to be submitted in the submission/ folder.

The same hyperparameters were used for all three tasks so we expect the hyperparameter to generalize. So we did three hyperparameters search on on/off, tremor, dysk and then we compared their performance to see which one is the best.

For CIS-PD, the best performance was obtained with tremor. For REAL-PD, it was watch_gyr tremor.

Dyskinesia - CIS-PD & REAL-PD

Submission 4 - CIS-PD

The following performs per Patient Tuning which we submitted in the 4th intermediate round. The following is for the CIS-PD database.

  1. In run_perpatient.sh, in the section to generate submission files, edit the absolute path to the CIS-PD_Test_Data_IDs_Labels.csv that is currently hardcoded.
  2. ./run_perpatient.sh
    • It will perform gridsearch_perpatient.py on every task. It will create files in mdl/cis-pd.on_off.1004.conf
    • Then, it will create predictions files to be submitted, in the submission folder like so : submission/cis-pd.on_off.perpatient.csv.

Submission 3 - REAL-PD

  1. In run_realpd.sh, edit the absolute path hardcoded to the REAL-PD labels and write your own path to the labels you downloaded from the website of the challenge.
  2. Run ./run_realpd.sh
    • This will create features in exp/, then merge will merge them, like this: features/watchgyr_total.scp.csv
    • Then it will perform GridSearch. The same hyperparameters were used for all three tasks so I expect the hyperparameter to generalize. So I did three hyperparameter search on on/off, tremor, dysk and then I compared their performance to see which one is the best. For REAL-PD, it was watchgyr and tremor. That's why in the code all the other GridSearch combinations are commented out. Only the one used for the 4th submission will be ran. The best hyperparameters found will be stored in mdl/real-pd.conf
    • Then we predict the results using src/predict_realpd.py. The predictions will be stored in submission/watchgyr_tremor.csv.

Approach II : AE + i-vectors + SVR

For dyskinesia, in the final submission, we performed a fusion with the average of the predictions between Approach 1 and Approach 2. The following section will help you create the files needed to perform the fusion.

AutoEncoder (AE) features

Train the AutoEncoder

  1. At the moment, all the code needed for the AE lives on a branch. So the first step is to checkout that branch with git checkout marie_ml_dl_real.
  2. conda env create --file environment_ae.yml : This will create the keras_tf2 environment you need to run AE experiments.
  3. Train an AE model & save their features:
    • For CIS-PD: At line 51 of the train_AE.py file, change the save_dir path to the directory where you want to store the AE models, which will be referred to as <your-path-to-AE-Features>.
    • For REAL-PD: At line 53 of the train_AE_real.py file, change the save_dir path to the directory where you want to store the AE models.
  4. Launch the training for the configurations you want. Some examples are available in this wiki page about Creating AutoEncoder Features. To reproduce the results of submission 4, you will need the following command which uses features of length 30 and a framelength of 400, with the inactivty removed:
python train_AE.py --saveAEFeats -dlP '{"remove_inactivity": "True", "my_data_path": "<path-to-BeatPD-data>/cis-pd.training_data/", "my_mask_path": "<your-path-to-AE-features>/cis-pd.training_data.high_pass_mask/"}' --saveFeatDir "<your-path-to-AE-features>/AE_30ft_orig_inactivity_removed/"
  1. This should create the following file <your-path-to-AE-features>/<Weights>/mlp_encoder_uad_False_ld_30.h5 and the features will be saved in the directory provided with the --saveFeatDir flag.

  2. Also generate features on the testing subset of the challenge with the following command:

python test_AE.py -dlP '{"my_data_path": "<path-to-BeatPD-data>/cis-pd.testing_data/", "my_mask_path": "<your-path-to-AE-features>/cis-pd.testing_data.high_pass_mask/", "remove_inactivity": "True"}' --saveAEFeats --saveFeatDir "<your-path-to-AE-features>/cis_testing_AE_30ft_orig_inactivity_removed"

Create i-vectors

After creating Autoencoder features, we can create i-vectors. The following steps will vary a lot depending on what i-vector you want to create. You will need to create dysk_noinact_auto30 to reproduce our final submission.

  1. cd <your-path-to-kaldi>/kaldi/egs/ : Change your directory to where you installed Kaldi.
  2. mkdir beatPDivec; cd beatPDivec : Create a directory to hold the i-vectors.
  3. cp <your-path-github-repo>/sid_novad/* ../sre08/v1/sid/. : Copy the novad.sh files from the repository to your Kaldi's directory
  4. mkdir <i-vector-name> : Create a folder with a meaningful name about the i-vectors we want to create. The nomenclature we used to name the i-vectors we created was also documented in the wiki. To reproduce the final submission, create dysk_noinact_auto30.
  5. cd <i-vector-name> : Change your directory to the i-vector folder you just created
  6. mkdir data
  7. cp -rf <your-path-github-repo>/beatPDivec/default_data/v2_auto/. .
  8. cp -rf <your-path-github-repo>/beatPDivec/default_data/autoencData/data/<onoff - tremor - dyskinesia>/. data/. : Copy the data for the task. For the final submission, use dyskinesia.
  9. ln -s sid ../../sre08/v1/sid; ln -s steps ../../sre08/v1/steps; ln -s utils ../../sre08/v1/utils : Create symbolic links
  10. vim runFor.sh: Edit the following variables:
    • subChallenge: use either onoff, tremor, or dysk.
    • sDirFeats: use the absolute path to the AE features you want to use. For the final submission, use sDirFeats=<path-to-AE-features>/AE_30ft_orig_inactivity_removed
  11. ./runFor.sh

Get Predictions CSV

Per Patient SVR

Option 1 - For the test subset of the challenge

  1. cd to the i-vector location, for example cd <your-path-to-kaldi>/kaldi/egs/beatPDivec/dysk_noinact_auto30/ was the i-vector used for the 4th submission.
  2. In the file <your-path-to-github-repo>/beatPDivec/default_data/v2_auto/local/pca_svr_bpd2.sh, make sure that the flag --bPatientPredictionsPkl is added to create pkl files for each subject_id, like this:
$cmd $sOut/pca_${iComponents}_svr_${sKernel}_${fCValueStr}_${fEpsilon}Testx.log \
     pca_knn_bpd2.py --input-trai $sFileTrai \
     --input-test $sFileTest \
     --output-file $sOut \
     --iComponents $iComponents \
     --sKernel $sKernel \
     --fCValue $fCValue \
     --fEpsilon $fEpsilon \
     --bPatientPredictionsPkl
conda deactivate
  1. Run runFinalsubm3_2.sh. This will call run_Final_auto.sh and create the folder resiVecPerPatientSVR_Fold_all for the test subset. But first, you need to edit some things:
    • sDirFeatsTest to point to the folder where you have extracted testing features with the AE, <your-path-to-AE-features>/cis_testing_AE_30ft_orig_inactivity_removed
    • sDirFeatsTrai to point to the folder where there is the training data <your-path-to-AE-features>/AE_30ft_orig_inactivity_removed
    • ivecDim : The i-vector size you are interested in, for the final submission, use ivecDim=650.
  1. Go to CreateCSV_test.ipynb. We will use the function generateCSVtest_per_patient to create a CSV containing test predictions for all subject_ids.

  2. Provide the variables best_config, dest_dir, and src_dir. To reproduce the final submission, simply keep the best_config as it is, and replace the paths with yours. The following code show you exactly what you should use:

best_config = {1004: ['/objs_450_kernel_linear_c_0.002_eps_0.1.pkl', 1.1469489658686098],
 1007: ['/objs_100_kernel_linear_c_0.002_eps_0.1.pkl', 0.09115239389591206],
 1019: ['/objs_400_kernel_linear_c_0.2_eps_0.1.pkl', 0.686931370820251],
 1023: ['/objs_300_kernel_linear_c_0.2_eps_0.1.pkl', 0.8462093717280431],
 1034: ['/objs_100_kernel_linear_c_20.0_eps_0.1.pkl', 0.7961188257851409],
 1038: ['/objs_450_kernel_linear_c_0.002_eps_0.1.pkl', 0.3530848340426855],
 1039: ['/objs_450_kernel_linear_c_0.2_eps_0.1.pkl', 0.3826339325882311],
 1043: ['/objs_300_kernel_linear_c_0.2_eps_0.1.pkl', 0.5525085362997469],
 1044: ['/objs_50_kernel_linear_c_0.002_eps_0.1.pkl', 0.09694768640213237],
 1048: ['/objs_650_kernel_linear_c_0.2_eps_0.1.pkl', 0.4505302952804157],
 1049: ['/objs_250_kernel_linear_c_0.2_eps_0.1.pkl', 0.4001809543831368]}

dest_dir='<your-path-to-kaldi>/kaldi/egs/beatPDivec/dysk_noinact_auto30/exp/ivec_650/resiVecPerPatientSVR_Fold_all/'
src_dir='<your-path-to-kaldi>/kaldi/egs/beatPDivec/dysk_noinact_auto30/exp/ivec_650/resiVecPerPatientSVR_Fold_all/'

generateCSVtest_per_patient(src_dir, dest_dir, best_config)

If you want to experiment with other best_config values, the dictionary for best_config is obtained in this file: cat <your-path-to-kaldi>/kaldi/egs/beatPDivec/dysk_noinact_auto30/exp/ivec_650/globalAccuPerPatientSVR_Test.log

  1. Run that cell, and it will create a csv file in the provided location dest_dir.

Approach III : Fusion

For the second and fourth submission, we performed some fusion of the predictions between an SVR and the xgboost.

The code to perform the fusion for the fourth submission is in the notebook called Fusion.ipynb.

Fusion for test subset of the challenge

  1. Open the Fusion.ipynb notebook
  2. Edit the variables:
    • sPathKaldi : Path to your folder containing the kaldi installation <your-path-to-kaldi>. Do not include /kaldi/.
    • sDirOut : Path to where you want to create the final CSV files with predictions
    • sPathData : Path to the folder containing the data we downloaded and extracted from the website of the challenge

The following sections will explain how to do the fusion of REAL-PD subtypes, how to do the fusion for the CIS-PD database between the two approaches, and finally, how to merge the CIS-PD and REAL-PD predictions.

Fusion of REAL-PD subtypes predictions

For the REAL-PD database, multiple sensors are provided : phoneacc, watchacc, and watchgyr. We can only submit one value for each measurement, so we used the following method:

  • If a measurement have 3 predictions (no missing files), then we do the average of the two closest values and discard the third value.
  • If there are only two predictions for a measurement, we simply do the average
  • If there is only one prediction out of the three subtype, then we use that value
  1. The fusion of REAL-PD subtypes predictions is made with the real_average_fusion function.
  2. Execute the cells under the header Fusion for REAL-PD sensors. There will be a cell with the function declared, then 3 others for ON/OFF, Tremor, and dyskinesia.
  3. It will create three files that can be sent to the challenge as our final submission:
    • submissionRealPDon_off.csv
    • submissionRealPDtremor.csv
    • submissionRealPDdyskinesia.csv

Fusion of CIS-PD Dyskinesia predictions for the two approaches

  1. Execute the cells under the heading "Submission 4 - Average of predictions for Approach 1 and 2 - CIS-PD"
  2. There will be an output telling you where your predictions file for dyskinesia was created, like so:
Submission file was created: <your-path>/submissionCisPDdyskinesia.csv

Fusion of CIS-PD and REAL-PD predictions

  1. Still in Fusion.ipynb, execute the cells under the heading Merge CIS-PD and REAL-PD predictions in one CSV file. It will create the final submimssion files for the three subtasks to be sent to the challenge.



Data Augmentation

  1. First generate the files we need. You can do so in the DataAugmentation notebook. (It is only generating training files, not for the test set given by the challenge yet, as we can't evaluate those results anyway at the moment as the labels are not public.)

  2. Tsfresh needs scp files containing the path to each training file. These are stored in tsfresh/submit/data/.

  3. cd tsfresh/submit/

  4. ./create_scp_files.sh combhpfnoinact.resample_0.9 : This will create new scp files needed for both training and testing data. The naming is convention is the following cis-pd.training.{argument given}.scp & cis-pd.testing.{argument given}.scp

  5. Duplicate any run_extract_features*.sh file and edit two variables:

    • Change the recog_set variable for the name of the scp files we just created, like so:
    recog_set="cis-pd.training.combhpfnoinact.resample_0.9 cis-pd.testing.combhpfnoinact.resample_0.9"
    
    • Edit the logdir directory for a folder where the jobs can be executed.
    logdir=exp/combhpfnoinact.resample_0.9
    
  6. Launch the extraction of the features:

qsub -l mem_free=30G,ram_free=30G -pe smp 6 -cwd -e /export/b19/mpgill/errors/errors_run_extract_features_resample_combhpfnoinact_0.9 -o /export/b19/mpgill/outputs/outputs_run_extract_features_resample_combhpfnoinact_0.9 run_extract_features_resample_combhpfnoinact_0.9.sh

References

  • The Biomarker and Endpoint Assessment to Track Parkinson's Disease (BEAT-PD) Challenge
  • Christ, M., Braun, N., Neuffer, J. and Kempa-Liehr A.W. (2018). Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh -- A Python package). Neurocomputing 307 (2018) 72-77, doi:10.1016/j.neucom.2018.03.067. GitHub
  • Dehak, Najim, et al. "Front-end factor analysis for speaker verification." IEEE Transactions on Audio, Speech, and Language Processing 19.4 (2010): 788-798.