/mimic_imputation

Generating the imputation dataset for the DACMI 2019 challenge

Primary LanguageMATLABMIT LicenseMIT

DACMI 2019 challenge on missing clinical data imputation

Generating the imputation dataset for the DACMI 2019 challenge https://ewh.ieee.org/conf/ichi/2019/challenge.html

If you don't want to run the code to regenerate the data, you can find the challenge data in the folder "dacmi_challenge_code_and_data".

The challenge participants' outputs can be found here.

R Code Dependencies

Need the following libraries:

mice
GPfit
hash
doParallel
foreach

Generate per subject per admission lab data

Assume the following root data directory

dn = /home/shared_data/mimic3

Create the following directories

mkdir $dn/perSubj/LabEvents
mkdir $dn/perSubj/LabData
mkdir $dn/perSubj/LabView
mkdir $dn/perSubj/LabViewCase
mkdir $dn/imp_data
mkdir $dn/train_groundtruth
mkdir $dn/train_with_missing
mkdir $dn/test_groundtruth
mkdir $dn/test_with_missing

First, split lab events in Python (create subdirectories if they don't exist)

import mimic3 as m3
m3.split_mimic3(fn='/home/shared_data/mimic3/LABEVENTS.csv', dnout='/home/shared_data/mimic3/perSubj/LabEvents', pref='le')

Then run the following R code:

source('labDataGen.R')
source('labViewGenSeq.R')
source('labViewCase.R')

Run single task Gaussian Process to impute

Assume the inverse signal to noise ratio is isnr=0.01, which is used in gpTensorImpValidation.m

Create the following directories

$dn/gpml/log
$dn/gpml/validation/gpml_raw_sample_$isnr

Run the following code in Python

import pscript as ps
ps.gpmlScript(fnscript='./gpmlMIMICScript.sh', fn=f'{dn}/lvcase.csv', fncf='mimicConfig', bsize=2000)

Run the following in shell

./gpmlMIMICScript.sh

After completion, run the following in shell

cd $dn/gpml/validation/gpml_raw_sample_$isnr
grep 'not enough training' *.err > ../../stgp_warning_ptads2.csv
grep 'non-varying' *.err > ../../stgp_warning_ptads1.csv
cd ../../

Run the following code in R

pt2 = read.csv('stgp_warning_ptads2.csv', header=F)
pt1 = read.csv('stgp_warning_ptads1.csv', header=F)
pt = union(gsub('^.*:', '', pt1$V1), gsub('^.*:', '', pt2$V1))
write.table(pt, file='stgp_warning_ptads.csv', row.names=F, col.names=F, quote=F)

Generate the data and split train and test sets

Run the following code in R

source('mimicConfig.R')
source('mimic_csv_gen.R')

The convention for generated csv is that row correspondes to time, column correspondes to variable.

Run 3D-MICE

Before running the code, configuration needs to be done by adapting and running the code in mimicConfig.R Please remember to create a subdirectory named "micegp_log" under $dn.

To train, run the following code directly:

source('mimicMICEGPParamEvalTr.R')

or better run as R markdown:

library(rmarkdown)
render('mimicMICEGPParamEvalTr.R')

This is a wrapper code calling various subroutines that generate the training data, mask missing values, and performs 3D-MICE imputation, each step is wrapped in its own R source file and should be self-explanatory.

In this wrapper code, nimp specifies how many MICE imputation to perform, ncores specifies how many cores to parallel the multiple imputations. nimp should be a multiply of ncores. You can set ncores to higher or lower values depending on the machine capacity. On a 20 core machine, this code should run in less than a day.

For your convenience, we also included the GP package gpml-matlab-v3.5-2014-12-08, but we do not claim any rights or responsibility for that code.

Citation

@article{10.1093/bib/bbab489,
    author = {Luo, Yuan},
    title = "{Evaluating the state of the art in missing data imputation for clinical data}",
    journal = {Briefings in Bioinformatics},
    year = {2021},
    month = {12},
    issn = {1477-4054},
    doi = {10.1093/bib/bbab489},
    url = {https://doi.org/10.1093/bib/bbab489},
}