This repo contains code for MIMIC-Extract. It has been divided into the following folders:
- Data: Locally contains the data to be extracted.
- Notebooks: Jupyter Notebooks demonstrating test cases and usage of output data in risk and intervention prediction tasks.
- Resources: Consist of Rohit_itemid.txt which describes the correlation of MIMIC-III item ids with those of MIMIC II as used by Rohit; itemid_to_variable_map.csv which is the main file used in data extraction - consists of groupings of item ids as well as which item ids are ready to extract; variable_ranges.csv which describes the normal variable ranges for the levels assisting in extraction of proper data. It also contains expected schema of output tables.
- Utils: scripts and detailed instructions for running MIMIC-Extract data pipeline.
mimic_direct_extract.py
: extraction script.
If you use this code in your research, please cite the following publication:
Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C. Hughes, Tristan Naumann,
and Marzyeh Ghassemi. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation
Pipeline for MIMIC-III. arXiv:1907.08322.
- Step 0: Required software and prereqs
- Step 1: Setup env vars for local system
- Step 2: Create conda environment
- Step 3: Build Views for Feature Extraction
- Step 4: Set Cohort Selection and Extraction Criteria
- Step 5: Build Curated Dataset from PostgreSQL
Your local system should have the following executables on the PATH:
- conda
- psql (PostgreSQL 9.4 or higher)
- git
- MIMIC-iii psql relational database (Refer to MIT-LCP Repo)
All instructions below should be executed from a terminal, with current directory set to utils/
cd utils/
Edit setup_user_env.sh so all paths point to valid locations on local file system and export those variables.
source ./setup_user_env.sh
Next, make a new conda environment from mimic_extract_env.yml and activate that environment.
conda env create --force -f ../mimic_extract_env.yml
conda activate mimic_data_extraction
The desired enviroment will be created and activated.
Will typically take less than 5 minutes. Requires a good internet connection.
Materialized views in the MIMIC PostgreSQL database will be generated. This includes all concept tables in MIT-LCP Repo and tables for extracting non-mechanical ventilation, and injections of crystalloid bolus and colloid bolus.
make build_concepts
Parameters for the extraction code are specified in build_curated_from_psql.sh
.
Cohort selection criteria regarding minimum admission age is set through min_age
; minimum and maximum
length of ICU stay in hours are set through min_duration
and max_duration
.
Only vitals and labs that contain over min_percent
percent non-missingness are extracted and extracted vitals and labs are
clinically aggregated unless group_by_level2
is explicitly set. Outlier correction is applied unless var_limit
is set to 0.
make build_curated_from_psql
The default setting will create an hdf5 file inside MIMIC_EXTRACT_OUTPUT_DIR with four tables:
-
patients: static demographics, static outcomes
- One row per (subj_id,hadm_id,icustay_id)
-
vitals_labs: time-varying vitals and labs (hourly mean, count and standard deviation)
- One row per (subj_id,hadm_id,icustay_id,hours_in)
-
vitals_labs_mean: time-varying vitals and labs (hourly mean only)
- One row per (subj_id,hadm_id,icustay_id,hours_in)
-
interventions: hourly binary indicators for administered interventions
- One row per (subj_id,hadm_id,icustay_id,hours_in)
Will probably take 5-10 hours. Will require a good machine with at least 50GB RAM.
By default, this step builds a dataset with all eligible patients. Sometimes, we wish to run with only a small subset of patients (debugging, etc.).
To do this, just set the POP_SIZE environmental variable. For example, to build a curated dataset with only the first 1000 patients, we could do:
POP_SIZE=100 make build_curated_from_psql