/clinical-fusion

Code and Datasets for the paper "Combining structured and unstructured data for predictive models: a deep learning approach", published on BMC Medical Informatics and Decision Making in 2020.

Primary LanguagePython

Combining structured and unstructured data for predictive models: a deep learning approach

This repository contains source code for paper Combining structured and unstructured data for predictive models: a deep learning approach. In this paper, we proposed 2 frameworks, namely Fusion-CNN and Fusion-LSTM, to combine sequential clinical notes and temporal signals for patient outcome prediction. Experiments of in-hospital mortality prediction, long length of stay prediction, and 30-day readmission prediction on MIMIC-III datasets empirically shows the effectiveness of proposed models. Combining structured and unstructured data leads to a significant performance improvement.

Framework

Fusion-CNN

Fusion-CNN is based on document embeddings, convolutional layers, max-pooling layers. The final patient representation is the concatenation of the latent representation of sequential clinical notes, temporal signals, and the static information vector. Then the final patient representation is passed to output layers to make predictions.

Fusion-LSTM

Fusion-LSTM is based on document embeddings, LSTM layers, max-pooling layers. The final patient representation is the concatenation of the latent representation of sequential clinical notes, temporal signals, and the static information vector. Then the final patient representation is passed to output layers to make predictions.

Requirements

Dataset

MIMIC-III database analyzed in the study is available on PhysioNet repository. Here are some steps to prepare for the dataset:

Software

  • Python 3.6.10
  • Gensim 3.8.0
  • NLTK: 3.4.5
  • Numpy: 1.14.2
  • Pandas: 0.25.3
  • Scikit-learn: 0.20.1
  • Tqdm: 4.42.1
  • PyTorch: 1.4.0

Preprocessing

$ python 00_define_cohort.py # define patient cohort and collect labels
$ python 01_get_signals.py # extract temporal signals (vital signs and laboratory tests)
$ python 02_extract_notes.py --firstday # extract first day clinical notes
$ python 03_merge_ids.py # merge admission IDs
$ python 04_statistics.py # run statistics
$ python 05_preprocess.py # run preprocessing
$ python 06_doc2vec.py --phase train # train doc2vec model
$ python 06_doc2vec.py --phase infer # infer doc2vec vectors

Run

Baselines

Baselines (i.e., logistic regression, and random forest) are implemented using scikit-learn. To run:

$ python baselines.py --model [model] --task [task] --inputs [inputs]

Deep models

Fusion-CNN and Fusion-LSTM are implemented using PyTorch. To run:

$ python main.py --model [model] --task [task] --inputs [input] # train Fusion-CNN or Fusion-LSTM
$ python main.py --model [model] --task [task] --inputs [input] --phase test --resume # evaluate