/TimEHR

Primary LanguageJupyter Notebook

Abstract: Time series in Electronic Health Records (EHRs) present unique challenges for generative models, such as irregular sampling, missing values, and high dimensionality. In this paper, we propose a novel generative adversarial network (GAN) model, TimEHR, to generate time series data from EHRs. In particular, TimEHR treats time series as images and is based on two conditional GANs. The first GAN generates missingness patterns, and the second GAN generates time series values based on the missingness pattern. Experimental results on three real-world EHR datasets show that TimEHR outperforms state-of-the-art methods in terms of fidelity, utility, and privacy metrics.

Contents

Installation

Clone the repository, create a virtual environment (venv or conda), and install the required packages using pip:

# clone the repository
git clone https://github.com/hojjatkarami/TimEHR.git
cd TimEHR

# using virtualenv
python3 -m venv test2
source test2/bin/activate

# using conda
conda create --name TimEHR python=3.9.7 --yes
conda activate TimEHR

# install the required packages
pip install -r requirements.txt

Datasets

We used three real-world EHRs datasets as well as simulated data in our experiments:

Dataset Name Size Number of Features
PhysioNet/Computing in Cardiology Challenge 2012 12k 35
PhysioNet/Computing in Cardiology Challenge 2019 38k 32
MIMIC-III 51k 37
Simulated Data 10k 16,32,64,128

We need to convert irregularly-sampled time series to images. Please refer to the data folder for more details on the datasets.


Converting time series to images.

Quick Start

We use hydra-core library for managing all configuration parameters. You can change them from configs/config.yaml.

We highly recommend using wandb for logging and tracking the experiments. Get your API key from wandb. Create a .env file in the root directory and add the following line:

WANDB_API_KEY=your_api_key

Training


Training Procedure.

The following command will train the model and generate synthetic time series for P12-split0 (You should have prepared the data in the data folder before running):

python train.py

This will train TimEHR modules (CWGAN-GP and Pix2Pix) for the default configuration (P12 dataset, split0) and prints the generated dataframe. Modules are saved locally in Results/{dataset}-s{split}/[CWGAN|Pix2Pix]/ folder as well as on wandb servers (account_name/[CWGAN|PIXGAN]).

Evaluation


Evaluation Procedure.

python eval.py Results/p12-s0

This will generate and evaluate synthetic time series for the trained models in the Results/p12-s0 folder and save the results in a wandb project TimEHR-Eval as well as locally in the Results/p12-s0/TimEHR-Eval folder.

For a more in-depth tutorial on how to train, generate, evaluate, and visualize the synthetic data, please checkout our notebook Tutorial.ipynb.

Replication of the results in the paper

To replicate the results in the paper, please follow the steps below:

  1. Run the following commands:
    python train.py -m data=p12 split=0,1,2,3,4
    python train.py -m data=mimic split=0,1,2,3,4
    python train.py -m data=p19 split=0,1,2,3,4 pix2pix.lambda_l1=100
    
  2. Use python eval.py Results/{dataset}-s{split} for the evaluation. The results will be saved in wanbd dashboard (account_name/TimEHR-Eval).

Citation

If you find this repo useful, please cite our paper via

@article{karami2024timehr,
  title={TimEHR: Image-based Time Series Generation for Electronic Health Records},
  author={Karami, Hojjat and Hartley, Mary-Anne and Atienza, David and Ionescu, Anisoara},
  journal={arXiv preprint arXiv:2402.06318},
  year={2024}
}