Generating the imputation dataset for the DACMI 2019 challenge https://ewh.ieee.org/conf/ichi/2019/challenge.html
If you don't want to run the code to regenerate the data, you can find the challenge data in the folder "dacmi_challenge_code_and_data".
The challenge participants' outputs can be found here.
Need the following libraries:
mice
GPfit
hash
doParallel
foreach
Assume the following root data directory
dn = /home/shared_data/mimic3
Create the following directories
mkdir $dn/perSubj/LabEvents
mkdir $dn/perSubj/LabData
mkdir $dn/perSubj/LabView
mkdir $dn/perSubj/LabViewCase
mkdir $dn/imp_data
mkdir $dn/train_groundtruth
mkdir $dn/train_with_missing
mkdir $dn/test_groundtruth
mkdir $dn/test_with_missing
First, split lab events in Python (create subdirectories if they don't exist)
import mimic3 as m3
m3.split_mimic3(fn='/home/shared_data/mimic3/LABEVENTS.csv', dnout='/home/shared_data/mimic3/perSubj/LabEvents', pref='le')
Then run the following R code:
source('labDataGen.R')
source('labViewGenSeq.R')
source('labViewCase.R')
Assume the inverse signal to noise ratio is isnr=0.01
, which is used in gpTensorImpValidation.m
Create the following directories
$dn/gpml/log
$dn/gpml/validation/gpml_raw_sample_$isnr
Run the following code in Python
import pscript as ps
ps.gpmlScript(fnscript='./gpmlMIMICScript.sh', fn=f'{dn}/lvcase.csv', fncf='mimicConfig', bsize=2000)
Run the following in shell
./gpmlMIMICScript.sh
After completion, run the following in shell
cd $dn/gpml/validation/gpml_raw_sample_$isnr
grep 'not enough training' *.err > ../../stgp_warning_ptads2.csv
grep 'non-varying' *.err > ../../stgp_warning_ptads1.csv
cd ../../
Run the following code in R
pt2 = read.csv('stgp_warning_ptads2.csv', header=F)
pt1 = read.csv('stgp_warning_ptads1.csv', header=F)
pt = union(gsub('^.*:', '', pt1$V1), gsub('^.*:', '', pt2$V1))
write.table(pt, file='stgp_warning_ptads.csv', row.names=F, col.names=F, quote=F)
Run the following code in R
source('mimicConfig.R')
source('mimic_csv_gen.R')
The convention for generated csv is that row correspondes to time, column correspondes to variable.
Before running the code, configuration needs to be done by adapting and running the code in mimicConfig.R Please remember to create a subdirectory named "micegp_log" under $dn
.
To train, run the following code directly:
source('mimicMICEGPParamEvalTr.R')
or better run as R markdown:
library(rmarkdown)
render('mimicMICEGPParamEvalTr.R')
This is a wrapper code calling various subroutines that generate the training data, mask missing values, and performs 3D-MICE imputation, each step is wrapped in its own R source file and should be self-explanatory.
In this wrapper code, nimp
specifies how many MICE imputation to perform, ncores
specifies how many cores to parallel the multiple imputations. nimp
should be a multiply of ncores
. You can set ncores
to higher or lower values depending on the machine capacity. On a 20 core machine, this code should run in less than a day.
For your convenience, we also included the GP package gpml-matlab-v3.5-2014-12-08
, but we do not claim any rights or responsibility for that code.
@article{10.1093/bib/bbab489,
author = {Luo, Yuan},
title = "{Evaluating the state of the art in missing data imputation for clinical data}",
journal = {Briefings in Bioinformatics},
year = {2021},
month = {12},
issn = {1477-4054},
doi = {10.1093/bib/bbab489},
url = {https://doi.org/10.1093/bib/bbab489},
}