This is a repository of code for data processing/analysis accompanying the above paper.
The processed data files are several gigabytes in size, and have not been included with this repository. One should be able to reproduce them exactly by re-running the data processing pipeline, however.
Be aware that some steps of analyses/data processing may take on the order of days to complete, depending on your machine. Some portions have been parallelized using the parallel package for performance.
EDITED (January 21, 2021): Refactored the directory structure of the code to something more sensible, corrected some instructions, and added additional documentation. Note that using Keras/Tensorflow in RStudio is not recommended. It is much more advisable to load your data directly into Python using the rpy2 package, and to work in Keras/Tensorflow or PyTorch in Python directly. However, for the sake of completeness, R code for running neural network models has been included in this repository.
Data is stored in the data/mimic/ folder.
Scripts located in: src/R/data_processing/mimic
These scripts should be run in order (and have filenames which are numbered accordingly), as subsequent scripts may depend on the output of those earlier in the pipeline.
- query_mimic.R - Queries chart data
- infection_antibiotics_cultures - Computes suspected infection using orders for antibiotics and cultures
- read_clinical_data_v2.R - Generates clinical data tables
- eval_sepsis3_mimic_v3.R - Evaluates clinical labels
- eval_sepsis2_mimic.R - Evaluates Sepsis-2 (based on SIRS criteria) clinical labels
- is_adult.R - Determines which patients are adults
- generate_test_tables.R - Generates data tables for evaluating early prediction
- generate_reference_data_mimic_2.R - Generates training data tables for cross-database validation
- generate_lstm_reference_data.R - Generates training data tables for recurrent neural networks
Data is stored in the data/eicu/ folder.
Scripts located in: src/R/data_processing/eicu
These scripts should be run in order (and have filenames which are numbered accordingly), as subsequent scripts depend on the output of those earlier in the pipeline.
- query_suspected_infection_eicu.R - queries eICU postres database for ICD-9 codes
- analyze_suspected_infection_icd9_eicu.R - determines which ICD-9 codes are indicative of suspected infection according to Angus et al.
- generate_clinical_tables_eicu.R - queries eICU postgres database and generates data tables
- eval_sepsis3_eicu.R - evaluates Sepsis-3 criteria
- generate_eicu_reference.R
- generate_eicu_test_tables- Generates testing data tables for cross-database validation
- generate_eicu_lstm_test_2.R
Scripts located in src/R/analysis
Checkpoint files produced by Keras will be stored in the checkpoints folder, and results in the results folder.
- final_concomitant_combined.R - produces results for glm/cox/xgboost
- final_concomitant_gru_replicates.R - produces results for RNN
RMarkdown notebooks for generating and displaying figures are located in: src/R/notebooks
- figure_1.Rmd
- figure_2.Rmd
- figure_3.Rmd
- figure_4.Rmd
- figure_5.Rmd
- figure_6.Rmd
Scripts located in src/R/analysis
Trains on MIMIC-3, tests on eICU
- analyze_cross_database_5.R - Validation for glm/xgboost
- final_icd9_gru_eicu_2 - Validation for RNN