A signature-based machine learning model for bipolar disorder and borderline personality disorder
This repository contains the code from the paper A signature-based machine learning model for bipolar disorder and borderline personality disorder:
Perez Arribas, I., Goodwin, G.M., Geddes, J.R., Lyons, T. and Saunders, K.E., 2018. A signature-based machine learning model for distinguishing bipolar disorder and borderline personality disorder. Translational Psychiatry, 8(1), p.274. DOI: 10.1038/s41398-018-0334-0.
Contents
- Reproducible Research Champions
- Data
- Setting up signatures-psychiatry
- Generating figures and tables from the paper
Reproducible Research Champions
In May 2018, Terry Lyons was selected as one of the Alan Turing Institute's Reproducible Research Champions - academics who encourage and promote reproducible research through their own work, and who want to take their latest project to the "next level" of reproducibility.
The Reproducible Research programme at the Turing is led by Kirstie Whitaker and Martin O'Reilly, with the Champions project also involving Louise Bowler from the Research Engineering Group.
Each of the Champions' projects will receive several weeks of support from the Research Engineering Group throughout 2018-2019. Over this time, we will work on the project with Terry and Imanol and will track our efforts in this repository. So far, we've added installation instructions, set the project up to work with Binder, made it possible to use synthetic data in the same workflow and added some tests. Given our focus on reproducibility, we obviously don't intend to change any of the code's core functionality - but we hope that our work, both past and future, will make it easier for you to install, use and test out your own ideas with the methods used in the signatures-psychiatry
project.
You can find out more about the Turing's Reproducible Research Champions project here.
Data
The dataset used in the study is confidential, so we are not able to publicly release it. Access to the dataset is restricted to staff and students of the University of Oxford who have received the appropriate permission.
However, we feel that it is important that some similar data are provided for the purposes of demonstrating how the methods outlined in the paper work. To this end, we include three sets of synthetic data, and one example of a fake entry in the dataset.
Mood score data The original dataset contains the mood scores of participants who were either healthy or were diagnosed with borderline personality disorder or bipolar disorder. Participants recorded their mood score on a seven-point scale across six different categories (anxiety, elation, sadness. anger, irritability and energy) at approximately daily intervals. Further details of this dataset, which was collected as part of the Automated Monitoring of Symptoms Severity (AMoSS) study, can be found in the paper. Access to this dataset is limited and so it is not included in this repository.
Synthetic signature data Synthetic data is data that has been generated to exhibit the same statistical properties as the original data, without containing the original entries. The synthetic data in this repository was derived from the signatures of the original mood score data, and is therefore in signature form itself. Each dataset contains mood score signatures and their associated diagnostic classification. The synthetic signatures were derived from all mood score signatures from each of the three diagnostic classifications, so the concept of "participant" does not apply when using this dataset. Three synthetic datasets were generated and have been included in the synthetic-data
folder.
Fake mood score data This repository contains an example of a fake mood score dataset from one participant in data/fake_patient.csv
. This data is not statistically related to the original mood score data but is presented in the same format in order to illustrate the data normalisation process and how Figures 1 and 2 in the paper were generated.
Setting up signatures-psychiatry
The instructions below assume you are comfortable cloning a git repository and running Python scripts via the command line.
If not, you may find the tutorials available from GitHub and Software Carpentry helpful. You can also open an interactive version of this project on mybinder.org
by clicking the badge below.
Once the Binder project loads, open a Python 2 console and skip down this page to "Generating figures and tables from the paper".
Begin by obtaining a copy of this repository using
git clone https://github.com/alan-turing-institute/signatures-psychiatry.git
and move into the directory
cd signatures-psychiatry
This project uses Python 2.7
and the packages listed in requirements.txt
.
Let's start by setting up a virtual environment and installing the dependencies inside it.
We give examples here for using virtualenv
and conda
.
If you use CPython, use virtualenv
to set up a virtual environment named env
.
Activate the environment and install the packages required by this project using pip
.
virtualenv env # Use if your default Python version is 2.7
virtualenv --python=<path-to-your-python-2.7> env # If not, specify the path manually
source env/bin/activate
pip install -r requirements.txt
If you use Anaconda, create the virtual environment with conda
and install the pip
package directly into the environment.
Once the environment is activated, the dependencies can be installed using pip
(we use pip
because the esig
package is not currently available through conda
).
conda create -n sig-psy python=2.7 pip
conda activate sig-psy
pip install -r requirements.txt
With the virtual environment set up and all the dependencies installed, we can use the scripts in this project by following the instructions below.
Generating figures and tables from the paper
If you are running this project via Binder (or any other Jupyter Lab installation), open a Python 2 console. The commands need a minor change in this environment - swap python
for %run
, and use the shift+enter
keys to run the cell.
Table 1: Accuracy and area under the ROC curve
If you have access to the full dataset, run
python pairwise_group_classification.py
To run the same analysis on the synthetic signatures, use
python pairwise_group_classification.py --synth
By default, the synthetic cohort 772192
will be used, or the cohort ID can be specified with --synth=<cohort-id>
.
The random seed for this script is set by default to 83042
, or it can be changed using --seed=<random-seed>
.
The pairwise values will be displayed in the terminal and also saved to a log file, log/mood_prediction.log
.
Table 2: Demographic characteristics of study participants
The content of this table was gathered manually.
Figure 1: Normalised anxiety scores of a sample participant
This repository contains a set of "fake" data from a single patient. To convert this dataset into the normalised format shown in Figure 1, run
python plot_path.py
Figure 2: Pairwise normalised mood scores of a sample participant
The same command used above also produces the pairwise mood score plots for the fake patient.
python plot_path.py
Figure 3 (top row):
The heat maps in Figure 3 can be produced from either the original data or the synthetic signatures. To use the original data, run
python heat_map.py
and to use the synthetic signatures, use
python heat_map.py --synth
Alternatives to the default cohort of 772192
and random seed of 1
can be set with, for example, the options --synth=239673 --seed=100
.
The script will save three figures in .png
format. The names of the figures and the settings can be found in the heat_map.log
file in the log
folder.
Figure 3 (bottom row): Accuracy and MAE of predictions of future mood score
The lower row of plots in Figure 3 require mood score data, so we cannot use the synthetic signature data here.
Table 3: Summary of accuracy and MAE of predictions of each aspect of next-day mood score
To generate the accuracy and MAE scores, run
python mood_prediction.py
As the next-day mood score is required, this script cannot be applied to the synthetic data which is already in the signature format.