surprising-sepsis: A Python repository from Sontag Lab

These files are used to read patient data and train random forest classifiers to predict risk of sepsis.

Patient data is recorded in timeline form. Every piece of information is recorded with a timestamp so
we know when it becomes available to the surveillance algorithm. Our goal is to predict
the likelihood of sepsis before it is "too late". Our current definition of too late is when the physician
demonstrates suspicion of infection by ordering a lactate lab test or antibiotics.

When training, we sample points uniformly from the patient timelines and use them in a random forest classifier.
When testing, we continuously predict, updating the prediction every time a new piece of information becomes available.
- A positive prediction occurs if we predict a value above a threshold before it's too late.
- Otherwise the prediction is considered negative.
- By varying that threshold we obtain an ROC and precision-recall curve.
- For any threshold, we can also determine how far before the deadline would the simulated alert fire.

Assumes that data files are available in csv format

Main training pipeline (see pipeline.sh):
-----------------------
set_up_directory.py
  * creates a working directory where files will be stored
  * settings are copied from the settings/ directory into the working directory

build_csn_mrn_map.py -- builds a mapping between csns and mrns in the cohort
build_labels.py -- uses icd9 codes to determine a label for each visit
build_train_test_split.py -- randomly splits visits into train/test
build_demographic_vectors.py -- demographics are stored separately

build_patient_timelines.py
  * reads patient data from data directory and stores in a shelf file for indexed access.

build_deadlines.py
  * reads from visit shelf and determines the "too late" time for each patient.
  * condition for "too late" is contained in function cutoff_record at the top of the file.

build_vocab.py

decision_tree_learning.py
  * reads patient records and builds feature vectors
  * learns a random forest classifier

decision_tree_testing.py
  * runs the learned decision tree on the test patient timelines, re-evaluating patients every
    time a new piece of information becomes available. Records the maximum value of predicted risk
    as well as first time classifier goes above a specified alerting threshold

Other important files
--------------------
  Patient.py
    * Describes the Patient representation object
    * Each new data record updates the internal state of
      the Patient object
  
  generate_vectors.py
    * Generate all feature vectors for a single patient 
    (i.e., one new feature vector every time a new piece 
    of data becomes available.)

  utils.py
    * some useful utility functions

  fieldReader
    * A CSV reader that extracts important fields and maps to standard names

Settings directory
------------------
The settings directory has the following files:

  FIELDS.txt
    * gives the FieldReader the required information to parse CSV files
  FILES_TO_READ.txt
    * a list of files that will be the input from which to build patient representations
  SEPSIS_CRITERIA.txt
    * not used
  SEPSIS_ICD9
    * ICD9 codes used to define the sepsis outcome
clinicalml/surprising-sepsis