This GitHub repository contains the code to reproduce the results obtained by the team JHU-CLSP during the BeatPD challenge. Data description and copyright:
These data were generated by participants of The Michael J. Fox Foundation for Parkinson's Research Mobile or Wearable Studies. They were obtained as part of the Biomarker & Endpoint Assessment to Track Parkinson's Disease DREAM Challenge (through Synapse ID syn20825169) made possible through partnership of The Michael J. Fox Foundation for Parkinson's Research, Sage Bionetworks, and BRAIN Commons
The challenge had 4 submission rounds before the final submission. Hereafter, they are addressed as 1st submission, 2nd submission, 3rd submission, 4th submission, final submission.
For the final submission, we submitted:
ON/OFF
:- CIS-PD: same as 3rd submission + foldaverage
- REAL-PD: same as 3rd submission + foldaverage
Tremor
:- CIS-PD: same as 3rd submission
- REAL-PD: same as 3rd submission
Dyskinesia
:- CIS-PD: same as 4th submission
- REAL-PD: same as 3rd submission
This README walks you through re-creating our final submission. For detailed write-up and to re-create all submissions, please follow our wiki documentation.
We have followed 3 approaches during the course of all our submissions :
Please note that due to a lack of development dataset, for all approaches, we performed 5-fold cross-validation and analyzed results of each dataset (CIS-PD and Real-PD) separately.
- Approach I : TSFRESH + XGBOOST
tsfresh extracts statistical data features from the signal and xgboost handles tabular data extraction and uses decision trees to select important features and combine them to make strong predictions.
- Approach II : AE + i-vector + Support Vector Regression (SVR)
The problem with using Deep Neural Network-based techniques directly on signals from wearble devices is that there is only one label for a 20 minute file. So, the first step is to reduce the raw signal to features. We used an DNN based auto-encoder (AE) to extract features. Later we use respresentatial learning method called i-vector
to convert the features into a vector of fixed size, regardless of the length of the signal. In this way, we used a combination of trained AE and i-vector extractor to obtain a single (fixed sized) vector per signal. Using i-vectors as features, we used Support Vector Regression (SVR) with linear kernel to predict the labels.
- Approach III : Fusion
A fusion of the predictions from Approach I and Approach II was done using either:
- Gradient boosting regression. The regressor was trained with the predicted labels from the testing folds from cross-validations.
- Average of predictions
This step-by-step guide will cover the following steps:
- Clone our repository from git
- Set up the environment
- Data Pre-Processing
- Code for all approaches
- Approach I : TSFRESH + XGBOOST
- Approach II : AutoEncoder (AE) + i-vectors + Get predictions CSV
- Approach III : Fusion
In your terminal, run the following command to clone our repository from git
$ git clone https://github.com/Mymoza/BeatPD-CLSP-JHU.git
We use python for majority of our scripts. We use jupyter notebook to facilitate an interactive envirnment. To run our scripts, please create an environment using requirements.txt
file by following these steps:
$ conda create --name BeatPD python=3.5 --file requirements.txt
$ conda activate BeatPD
- Note: Make sure that the Jupyter notebook is running on
BeatPD
kernel.
If the conda environment isn't showing in Jupyter kernels (Kernel > Change Kernel > BeatPD), run:
$ ipython kernel install --user --name=BeatPD
You will then be able to select BeatPD
as your kernel.
You need to install Kaldi. For installation, you can use either the official install instructions or the easy install instructions if you find the official one difficult to follow.
First step is to prepare the data given by the challenge. All the steps to do pre-processing on the data is done in the Jupyter Notebook prepare_data.ipynb
.
- Download the training_data, the ancillary_data and the testing_data from the challenge website
mkdir BeatPD_data
Create a folder to contain all the data for the challenge. Put all the files.tar.bz2
you just downloaded for the challenge in this newly created folder, as well ascis-pd.CIS-PD_Test_Data_IDs.csv
andreal-pd.REAL-PD_Test_Data_IDs.csv
.
BeatPD_data $ ls
cis-pd.ancillary_data.tar.bz2 real-pd.ancillary_data_updated.tar.bz2
cis-pd.CIS-PD_Test_Data_IDs.csv real-pd.data_labels.tar.bz2
cis-pd.data_labels.tar.bz2 real-pd.REAL-PD_Test_Data_IDs.csv
cis-pd.testing_data.tar.bz2 real-pd.testing_data_updated.tar.bz2
cis-pd.training_data.tar.bz2 real-pd.training_data_updated.tar.bz2
- Open the notebook
prepare_data.ipynb
- Change the
data_dir
variable for the absolute path to the folderBeatPD_data
that contains the data given by the challenge. - Execute the cells under
Extract initial data
and you should have the following directories when it's done:
<path-to-BeatPD_data> $ ls
cis-pd.ancillary_data cis-pd.testing_data real-pd.ancillary_data real-pd.testing_data
cis-pd.data_labels cis-pd.training_data real-pd.data_labels real-pd.training_data
- Execute the rest of the cells in the Notebook. It will create several folders needed to reproduce the experiments. The data directory structure is documented in the wiki.
For this scheme, all the files are in <path-github-repo>/tsfresh/submit/
.
|-- run.sh : CIS-PD - Submission 3 - run the tsfresh + xgboost scheme without per patient tuning
|-- run_perpatient.sh : CIS-PD - Submission 4 - run the tsfresh + xgboost scheme with per patient tuning
|-- run_realpd.sh : REAL-PD - Submission 4 - run the tsfresh + xgboost scheme without per patient tuning
|
|-- data: Challenge data
|-- label.csv
|-- exp: Feature extraction jobs that were divided in 32 subsets
|-- features: Folder containing the extracted features
|-- cis-pd.training.csv
|-- cis-pd.testing.csv
|
|-- mdl:
|-- cis-pd.conf : best config for the three subchallenges
|-- cis-pd.****.conf : best config tuned per patient for the three subchallenges
|
|-- src: Folder containing the files to generate features and predictions
|
|--- generator.py: Feature extraction for CIS
|
|--- gridsearch.py: Find best hyperparams and save them to a file
(same params for all subjects)
|
|--- gridsearch_perpatient.py: Find best hyperparams for each subject
and save them to a file
|
|--- predict.py: Predicts and creates submission files
|--- predict_perpatient.py: Predict with perpatient tuning
|
|-- submission: Folder containing the CSV files with predictions to submit
|-- utils: soft link to kaldi/egs/wsj/s5/utils/
Prepare the environment and create a symbolic link:
- Create a softlink from
tsfresh/submit/utils/
tokaldi/egs/wsj/s5/utils/
. cd tsfresh/submit/
conda create --name BeatPD_xgboost --file tsfresh_xgboost_environment.yml
conda activate BeatPD_xgboost
- In the data/ folder, add
BEAT-PD_SC1_OnOff_Submission_Template.csv
,BEAT-PD_SC2_Dyskinesia_Submission_Template.csv
andBEAT-PD_SC3_Tremor_Submission_Template.csv
downloaded from the challenge
As you can see in our write-up, for the final submission, the following sections need to be generated to create predictions files for tsfresh.
The following sections explains how to reproduce our final submission.
Instead of training one model on whole training set, we used our 5-fold to get five different models. We averaged predictions from those five models. The benefit of this approach is that for each model, we can use the test fold to do the early stop to avoid overfitting. Also combination of five systems may improve the overall performance.
- In
run_foldaverage.sh
, edit the absolute path to theCIS-PD_Test_Data_IDs_Labels.csv
andREAL-PD_Test_Data_IDs_Labels.csv
that are currently hardcoded. - Run
run_foldaverage.sh
, which will run the necessary code for both databases. It will create the following files:submission/cis-pd.on_off_new.csv
files containing predictions on the test subset for CIS-PD.submission/<watchgyr - watchacc - phoneacc>_on_off.csv
: For REAL-PD on test subset
- In
run.sh
, in the section to generate submission files, edit the absolute path to theCIS-PD_Test_Data_IDs_Labels.csv
that is currently hardcoded. - Run
./run.sh
. You might need to make some changes to this file. It is written to be ran on a grid engine.- It will split the CIS-PD training and testing csv files into 32 subsets and submit 32 jobs to do feature extraction. Then, it will merge all of them to store the features in the
features/
directory. This step only need to be ran once. - Then it will perform a GridSearch, saving the best config
- Finally, it will create predictions files to be submitted in the
submission/
folder.
- It will split the CIS-PD training and testing csv files into 32 subsets and submit 32 jobs to do feature extraction. Then, it will merge all of them to store the features in the
The same hyperparameters were used for all three tasks so we expect the hyperparameter to generalize. So we did three hyperparameters search on on/off, tremor, dysk and then we compared their performance to see which one is the best.
For CIS-PD, the best performance was obtained with tremor. For REAL-PD, it was watch_gyr tremor.
The following performs per Patient Tuning which we submitted in the 4th intermediate round. The following is for the CIS-PD database.
- In
run_perpatient.sh
, in the section to generate submission files, edit the absolute path to theCIS-PD_Test_Data_IDs_Labels.csv
that is currently hardcoded. ./run_perpatient.sh
- It will perform
gridsearch_perpatient.py
on every task. It will create files inmdl/cis-pd.on_off.1004.conf
- Then, it will create predictions files to be submitted, in the
submission
folder like so :submission/cis-pd.on_off.perpatient.csv
.
- It will perform
- In
run_realpd.sh
, edit the absolute path hardcoded to the REAL-PD labels and write your own path to the labels you downloaded from the website of the challenge. - Run
./run_realpd.sh
- This will create features in
exp/
, then merge will merge them, like this:features/watchgyr_total.scp.csv
- Then it will perform GridSearch. The same hyperparameters were used for all three tasks so I expect the hyperparameter to generalize. So I did three hyperparameter search on on/off, tremor, dysk and then I compared their performance to see which one is the best. For REAL-PD, it was
watchgyr
andtremor
. That's why in the code all the other GridSearch combinations are commented out. Only the one used for the 4th submission will be ran. The best hyperparameters found will be stored inmdl/real-pd.conf
- Then we predict the results using
src/predict_realpd.py
. The predictions will be stored insubmission/watchgyr_tremor.csv
.
- This will create features in
For dyskinesia, in the final submission, we performed a fusion with the average of the predictions between Approach 1 and Approach 2. The following section will help you create the files needed to perform the fusion.
- At the moment, all the code needed for the AE lives on a branch. So the first step is to checkout that branch with
git checkout marie_ml_dl_real
. conda env create --file environment_ae.yml
: This will create thekeras_tf2
environment you need to run AE experiments.- Train an AE model & save their features:
- For CIS-PD: At line 51 of the
train_AE.py
file, change thesave_dir
path to the directory where you want to store the AE models, which will be referred to as<your-path-to-AE-Features>
. - For REAL-PD: At line 53 of the
train_AE_real.py
file, change thesave_dir
path to the directory where you want to store the AE models.
- For CIS-PD: At line 51 of the
- Launch the training for the configurations you want. Some examples are available in this wiki page about Creating AutoEncoder Features. To reproduce the results of submission 4, you will need the following command which uses features of length 30 and a framelength of 400, with the inactivty removed:
python train_AE.py --saveAEFeats -dlP '{"remove_inactivity": "True", "my_data_path": "<path-to-BeatPD-data>/cis-pd.training_data/", "my_mask_path": "<your-path-to-AE-features>/cis-pd.training_data.high_pass_mask/"}' --saveFeatDir "<your-path-to-AE-features>/AE_30ft_orig_inactivity_removed/"
-
This should create the following file
<your-path-to-AE-features>/<Weights>/mlp_encoder_uad_False_ld_30.h5
and the features will be saved in the directory provided with the--saveFeatDir
flag. -
Also generate features on the testing subset of the challenge with the following command:
python test_AE.py -dlP '{"my_data_path": "<path-to-BeatPD-data>/cis-pd.testing_data/", "my_mask_path": "<your-path-to-AE-features>/cis-pd.testing_data.high_pass_mask/", "remove_inactivity": "True"}' --saveAEFeats --saveFeatDir "<your-path-to-AE-features>/cis_testing_AE_30ft_orig_inactivity_removed"
After creating Autoencoder features, we can create i-vectors. The following steps will vary a lot depending on what i-vector you want to create. You will need to create dysk_noinact_auto30
to reproduce our final submission.
cd <your-path-to-kaldi>/kaldi/egs/
: Change your directory to where you installed Kaldi.mkdir beatPDivec; cd beatPDivec
: Create a directory to hold the i-vectors.cp <your-path-github-repo>/sid_novad/* ../sre08/v1/sid/.
: Copy thenovad.sh
files from the repository to your Kaldi's directorymkdir <i-vector-name>
: Create a folder with a meaningful name about the i-vectors we want to create. The nomenclature we used to name the i-vectors we created was also documented in the wiki. To reproduce the final submission, createdysk_noinact_auto30
.cd <i-vector-name>
: Change your directory to the i-vector folder you just createdmkdir data
cp -rf <your-path-github-repo>/beatPDivec/default_data/v2_auto/. .
cp -rf <your-path-github-repo>/beatPDivec/default_data/autoencData/data/<onoff - tremor - dyskinesia>/. data/.
: Copy the data for the task. For the final submission, usedyskinesia
.ln -s sid ../../sre08/v1/sid; ln -s steps ../../sre08/v1/steps; ln -s utils ../../sre08/v1/utils
: Create symbolic linksvim runFor.sh
: Edit the following variables:subChallenge
: use eitheronoff
,tremor
, ordysk
.sDirFeats
: use the absolute path to the AE features you want to use. For the final submission, usesDirFeats=<path-to-AE-features>/AE_30ft_orig_inactivity_removed
./runFor.sh
cd
to the i-vector location, for examplecd <your-path-to-kaldi>/kaldi/egs/beatPDivec/dysk_noinact_auto30/
was the i-vector used for the 4th submission.- In the file
<your-path-to-github-repo>/beatPDivec/default_data/v2_auto/local/pca_svr_bpd2.sh
, make sure that the flag--bPatientPredictionsPkl
is added to create pkl files for each subject_id, like this:
$cmd $sOut/pca_${iComponents}_svr_${sKernel}_${fCValueStr}_${fEpsilon}Testx.log \
pca_knn_bpd2.py --input-trai $sFileTrai \
--input-test $sFileTest \
--output-file $sOut \
--iComponents $iComponents \
--sKernel $sKernel \
--fCValue $fCValue \
--fEpsilon $fEpsilon \
--bPatientPredictionsPkl
conda deactivate
- Run
runFinalsubm3_2.sh
. This will callrun_Final_auto.sh
and create the folderresiVecPerPatientSVR_Fold_all
for the test subset. But first, you need to edit some things:sDirFeatsTest
to point to the folder where you have extracted testing features with the AE,<your-path-to-AE-features>/cis_testing_AE_30ft_orig_inactivity_removed
sDirFeatsTrai
to point to the folder where there is the training data<your-path-to-AE-features>/AE_30ft_orig_inactivity_removed
ivecDim
: The i-vector size you are interested in, for the final submission, useivecDim=650
.
-
Go to
CreateCSV_test.ipynb
. We will use the functiongenerateCSVtest_per_patient
to create a CSV containing test predictions for all subject_ids. -
Provide the variables
best_config
,dest_dir
, andsrc_dir
. To reproduce the final submission, simply keep thebest_config
as it is, and replace the paths with yours. The following code show you exactly what you should use:
best_config = {1004: ['/objs_450_kernel_linear_c_0.002_eps_0.1.pkl', 1.1469489658686098],
1007: ['/objs_100_kernel_linear_c_0.002_eps_0.1.pkl', 0.09115239389591206],
1019: ['/objs_400_kernel_linear_c_0.2_eps_0.1.pkl', 0.686931370820251],
1023: ['/objs_300_kernel_linear_c_0.2_eps_0.1.pkl', 0.8462093717280431],
1034: ['/objs_100_kernel_linear_c_20.0_eps_0.1.pkl', 0.7961188257851409],
1038: ['/objs_450_kernel_linear_c_0.002_eps_0.1.pkl', 0.3530848340426855],
1039: ['/objs_450_kernel_linear_c_0.2_eps_0.1.pkl', 0.3826339325882311],
1043: ['/objs_300_kernel_linear_c_0.2_eps_0.1.pkl', 0.5525085362997469],
1044: ['/objs_50_kernel_linear_c_0.002_eps_0.1.pkl', 0.09694768640213237],
1048: ['/objs_650_kernel_linear_c_0.2_eps_0.1.pkl', 0.4505302952804157],
1049: ['/objs_250_kernel_linear_c_0.2_eps_0.1.pkl', 0.4001809543831368]}
dest_dir='<your-path-to-kaldi>/kaldi/egs/beatPDivec/dysk_noinact_auto30/exp/ivec_650/resiVecPerPatientSVR_Fold_all/'
src_dir='<your-path-to-kaldi>/kaldi/egs/beatPDivec/dysk_noinact_auto30/exp/ivec_650/resiVecPerPatientSVR_Fold_all/'
generateCSVtest_per_patient(src_dir, dest_dir, best_config)
If you want to experiment with other best_config
values, the dictionary for best_config is obtained in this file:
cat <your-path-to-kaldi>/kaldi/egs/beatPDivec/dysk_noinact_auto30/exp/ivec_650/globalAccuPerPatientSVR_Test.log
- Run that cell, and it will create a
csv
file in the provided locationdest_dir
.
For the second and fourth submission, we performed some fusion of the predictions between an SVR and the xgboost.
The code to perform the fusion for the fourth submission is in the notebook called Fusion.ipynb
.
- Open the
Fusion.ipynb
notebook - Edit the variables:
sPathKaldi
: Path to your folder containing the kaldi installation<your-path-to-kaldi>
. Do not include/kaldi/
.sDirOut
: Path to where you want to create the final CSV files with predictionssPathData
: Path to the folder containing the data we downloaded and extracted from the website of the challenge
The following sections will explain how to do the fusion of REAL-PD subtypes, how to do the fusion for the CIS-PD database between the two approaches, and finally, how to merge the CIS-PD and REAL-PD predictions.
For the REAL-PD database, multiple sensors are provided : phoneacc
, watchacc
, and watchgyr
. We can only submit one value for each measurement, so we used the following method:
- If a measurement have 3 predictions (no missing files), then we do the average of the two closest values and discard the third value.
- If there are only two predictions for a measurement, we simply do the average
- If there is only one prediction out of the three subtype, then we use that value
- The fusion of REAL-PD subtypes predictions is made with the
real_average_fusion
function. - Execute the cells under the header
Fusion for REAL-PD sensors
. There will be a cell with the function declared, then 3 others forON/OFF
,Tremor
, anddyskinesia
. - It will create three files that can be sent to the challenge as our final submission:
submissionRealPDon_off.csv
submissionRealPDtremor.csv
submissionRealPDdyskinesia.csv
- Execute the cells under the heading "Submission 4 - Average of predictions for Approach 1 and 2 - CIS-PD"
- There will be an output telling you where your predictions file for dyskinesia was created, like so:
Submission file was created: <your-path>/submissionCisPDdyskinesia.csv
- Still in
Fusion.ipynb
, execute the cells under the headingMerge CIS-PD and REAL-PD predictions in one CSV file
. It will create the final submimssion files for the three subtasks to be sent to the challenge.
-
First generate the files we need. You can do so in the DataAugmentation notebook. (It is only generating training files, not for the test set given by the challenge yet, as we can't evaluate those results anyway at the moment as the labels are not public.)
-
Tsfresh needs
scp
files containing the path to each training file. These are stored intsfresh/submit/data/
. -
cd tsfresh/submit/
-
./create_scp_files.sh combhpfnoinact.resample_0.9
: This will create new scp files needed for both training and testing data. The naming is convention is the followingcis-pd.training.{argument given}.scp
&cis-pd.testing.{argument given}.scp
-
Duplicate any
run_extract_features*.sh
file and edit two variables:- Change the
recog_set
variable for the name of thescp
files we just created, like so:
recog_set="cis-pd.training.combhpfnoinact.resample_0.9 cis-pd.testing.combhpfnoinact.resample_0.9"
- Edit the
logdir
directory for a folder where the jobs can be executed.
logdir=exp/combhpfnoinact.resample_0.9
- Change the
-
Launch the extraction of the features:
qsub -l mem_free=30G,ram_free=30G -pe smp 6 -cwd -e /export/b19/mpgill/errors/errors_run_extract_features_resample_combhpfnoinact_0.9 -o /export/b19/mpgill/outputs/outputs_run_extract_features_resample_combhpfnoinact_0.9 run_extract_features_resample_combhpfnoinact_0.9.sh
- The Biomarker and Endpoint Assessment to Track Parkinson's Disease (BEAT-PD) Challenge
- Christ, M., Braun, N., Neuffer, J. and Kempa-Liehr A.W. (2018). Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh -- A Python package). Neurocomputing 307 (2018) 72-77, doi:10.1016/j.neucom.2018.03.067. GitHub
- Dehak, Najim, et al. "Front-end factor analysis for speaker verification." IEEE Transactions on Audio, Speech, and Language Processing 19.4 (2010): 788-798.