MIMIC-III Benchmarks

Python suite to construct benchmark machine learning datasets from the MIMIC-III clinical database. Currently, we are focused on building a multitask learning benchmark dataset that includes four key inpatient clinical prediction tasks that map onto core machine learning problems: prediction of mortality from early admission data (classification), real-time detection of decompensation (time series classification), forecasting length of stay (regression), and phenotype classification (multilabel sequence classification).

News

2017 March 23: We are pleased to announce the first official release of these benchmarks. We expect to release a revision within the coming months that will add at least ~50 additional input variables. We are likewise pleased to announce that the manuscript associated with these benchmarks is now available on arXiv.

Citation

If you use this code or these benchmarks in your research, please cite the following publication: Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, and Aram Galstyan. Multitask Learning and Benchmarking with Clinical Time Series Data. arXiv:1703.07771 which is now available on arXiv. This paper is currently under review for SIGKDD and if accepted, the citation will change. Please be sure also to cite the original MIMIC-III paper.

Motivation

Despite rapid growth in research that applies machine learning to clinical data, progress in the field appears far less dramatic than in other applications of machine learning. In image recognition, for example, the winning error rates in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) plummeted almost 90% from 2010 (0.2819) to 2016 (0.02991). There are many reasonable explanations for this discrepancy: clinical data sets are inherently noisy and uncertain and often small relative to their complexity, and for many problems of interest, ground truth labels for training and evaluation are unavailable.

However, there is another, simpler explanation: practical progress has been difficult to measure due to the absence of community benchmarks like ImageNet. Such benchmarks play an important role in accelerating progress in machine learning research. For one, they focus the community on specific problems and stoke ongoing debate about what those problems should be. They also reduce the startup overhead for researchers moving into a new area. Finally and perhaps most important, benchmarks facilitate reproducibility and direct comparison of competing ideas.

Here we present four public benchmarks for machine learning researchers interested in health care, built using data from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database (paper, website). Our four clinical prediction tasks are critical care variants of four opportunities to transform health care using in "big clinical data" as described in Bates, et al, 2014:

early triage and risk assessment, i.e., mortality prediction
prediction of physiologic decompensation
identification of high cost patients, i.e. length of stay forecasting
characterization of complex, multi-system diseases, i.e., acute care phenotyping

In Harutyunyan, Khachatrian, Kale, and Galstyan 2017, we propose a multitask RNN architecture to solve these four tasks simultaneously and show that this model generally outperforms strong single task baselines.

Requirements

We do not provide the MIMIC-III data itself. You must acquire the data yourself from https://mimic.physionet.org/. Specifically, download the CSVs. Otherwise, generally we make liberal use of the following packages:

numpy
pandas

For logistic regression baselines sklearn is required. LSTM models use Keras.

Building a benchmark

Here are the required steps to build the benchmark. It assumes that you already have MIMIC-III dataset (lots of CSV files) on the disk.

Clone the repo.

git clone https://github.com/YerevaNN/mimic3-benchmarks/
cd mimic3-benchmarks/

Add the path to the PYTHONPATH (sorry for this).

export PYTHONPATH=$PYTHONPATH:[PATH TO THIS REPOSITORY]

The following command takes MIMIC-III CSVs, generates one directory per SUBJECT_ID and writes ICU stay information to data/[SUBJECT_ID/stays.csv, diagnoses to data/[SUBJECT_ID]/diagnoses.csv, and events to data/[SUBJECT_ID]/events.csv. This step might take around an hour.
```
python scripts/extract_subjects.py [PATH TO MIMIC-III CSVs] data/root/
```
The following command attempts to fix some issues (ICU stay ID is missing) and removes the events that have missing information. 4741761 events (80%) remain after removing all suspicious rows.
```
python scripts/validate_events.py data/root/
```
The next command breaks up per-subject data into separate episodes (pertaining to ICU stays). Time series of events are stored in [SUBJECT_ID]/episode{#}_timeseries.csv (where # counts distinct episodes) while episode-level information (patient age, gender, ethnicity, height, weight) and outcomes (mortality, length of stay, diagnoses) are stores in [SUBJECT_ID]/episode{#}.csv. This script requires two files, one that maps event ITEMIDs to clinical variables and another that defines valid ranges for clinical variables (for detecting outliers, etc.).
```
python scripts/extract_episodes_from_subjects.py data/root/
```
The next command splits the whole dataset into training and testing sets. Note that all benchmarks use the same split:
```
python scripts/split_train_and_test.py data/root/
```

The following commands will generate task-specific datasets, which can later be used in models. These commands are independent, if you are going to work only on one benchmark task, you can run only the corresponding command.

python scripts/create_in_hospital_mortality.py data/root/ data/in-hospital-mortality/
python scripts/create_decompensation.py data/root/ data/decompensation/
python scripts/create_length_of_stay.py data/root/ data/length-of-stay/
python scripts/create_phenotyping.py data/root/ data/phenotyping/
python scripts/create_multitask.py data/root/ data/multitask/

Working with baseline models

For each of the 4 main tasks we provide logistic regression and LSTM baselines. Please note that running linear models can take hours because of extensive grid search. You can change the chunk_size parameter in codes and they will became faster (of course the performance will not be the same).

Train / validation split

Use the following command to extract validation set from the traning set. This step is required for running the baseline models.

   python mimic3models/split_train_val.py [TASK]

[TASK] is either in-hospital-mortality, decompensation, length-of-stay, phenotyping or multitask.

In-hospital mortality prediction

Run the following command to train the neural network which gives the best result. We got the best performance on validation set after 8 epochs.

   cd mimic3models/in_hospital_mortality/
   python -u main.py --network lstm --dim 256 --timestep 2.0 --mode train --batch_size 8 --log_every 30

To test the model use the following:

   python -u main.py --network lstm --dim 256 --timestep 2.0 --mode test --batch_size 8 --log_every 30 --load_state best_model.state

Use the following command to train logistic regression. The best model we got used L2 regularization with C=0.001:

   cd mimic3models/in_hospital_mortality/logistic/
   python -u main.py --l2 --C 0.001

Decompensation prediction

The best model we got for this task was trained for 110 chunks (that's less than one epoch; it overfits before reaching one epoch because there are many training samples for the same patient with different lengths).

   cd mimic3models/decompensation/
   python -u main.py --network lstm --dim 256 --mode train --batch_size 8 --log_every 30

Here is the command to test:

   python -u main.py --network lstm --dim 256 --mode test --batch_size 8 --log_every 30 --load_state best_model.state

Use the following command to train a logistic regression. It will do a grid search over a small space of hyperparameters and will report the scores for every case.

   cd mimic3models/decompensation/logistic/
   python -u main.py

Length of stay prediction

The best model we got for this task was trained for 15 chunks.

   cd mimic3models/length_of_stay/
   python -u main.py --network lstm_cf_custom --dim 256 --mode train --batch_size 8 --log_every 30

Run the following command to test the best pretrained neural network.

   python -u main.py --network lstm_cf_custom --dim 256 --mode test --batch_size 8 --log_every 30 --load_state best_model.state

Use the following command to train a logistic regression. It will do a grid search over a small space of hyperparameters and will report the scores for every case.

   cd mimic3models/length_of_stay/logistic/
   python -u main_cf.py

Phenotype classification

The best model we got for this task was trained for 30 epochs.

   cd mimic3models/phenotyping/
   python -u main.py --network lstm_2layer --dim 512 --mode train --batch_size 8 --log_every 30

Use the following command for testing:

   python -u main.py --network lstm_2layer --dim 512 --mode test --batch_size 8 --log_every 30 --load_state best_model.state

Use the following command for logistic regression. It will do a grid search over a small space of hyperparameters and will report the scores for every case.

   cd mimic3models/phenotyping/logistic/
   python -u main.py

Multitask learning

ihm_C, decomp_C, los_C and ph_C coefficients control the relative weight of the tasks in the multitask model. Default is 1.0. The best model we got was trained for 12 epochs.

   cd mimic3models/multitask/
   python -u main.py --network lstm --dim 1024 --mode train --batch_size 8 --log_every 30 --ihm_C 0.02 --decomp_C 0.1 --los_C 0.5