/ehr-ohdsi-processor

OHDSI data processor used to generate various features from EHRs (Electronic Health Records). It was developed and deployed for EHR Morbidity DREAM Challenge.

Primary LanguageScalaApache License 2.0Apache-2.0

EHR OHDSI Processor version License Build Status

  • Single pass EHR (Electronic Health Record) feature generation and extraction pipeline using OHDSI standardized tables/csvs and concepts.

  • The processor is implemented using Akka streaming technology, which contributes to memory efficient, fast, and scalable asynchronous processing. The core abstraction is a so-called flow: an in-and-out foldable stream attached to a csv source with prefiltered values based on matching date time intervals. Flows for different features are bundled where the input parsed values are broadcasted, zipped and unzipped accordingly.

  • Features are generated based on JSON-like settings in application.conf, which can be freely altered and rerun without touching or recompiling the code.

  • Supported features types are:

    • Counts - number of records per date interval.

    • Distinct Counts - number of distinct values within a given column (e.g., concept ids and care sites) per date interval.

    • Concept Category Counts - number of records whose concept ids for a given column belong to a given category per date interval.

    • Concept Category Exist Flags - binary flag indicating whether there exists at least one record with a concept id for a given column belonging to a given category per date interval.

    • Duration - duration from the first occurrence calculated per date interval.

    • Sums - sum of values for a given column and date interval; used purely for drug quantity.

    • Time-lag Features - records (for each table/csv file) are sorted by date, then time lags are calculated and compacted to mean, std, min, max, and dominant relative differential frequency (grouped to 5 bins: -2,-1,0,1,2) indicating the prevalent direction/acceleration of record dates.

    • Comorbidity scores - linear comorbidity measure calculated based on Elixhuser's categories per date interval (reported to have good predictability for short-term mortality). Two versions with different weights were used: AHRQ and van Walraven.

    • Dynamically Calculated Comorbidity Scores - instead of using fixed weights as above, weights are dynamically calculated based on differential relative ratios of the dead patients (withing 6 months) vs others. Two versions were implemented (see bellow): the first one uses the same categories as Elixhauser, the second one takes into account only serious conditions.

    • Non-Aggregate (Static) Features - these features are directly generated from person.csv and include gender (concepts binarized), age_at_last_visit, year_of_birth, and month_of_birth.

  • Additionally, the processor supports custom date intervals counted from the last visit (for each person), such as last 6 months, last 5-3 years, etc. For each such date interval a set of features are generated.

Prerequisite

  • JDK 1.8 or higher

Build

To create an executable jar with all dependencies run

sbt assembly

This will produce a file such as ehr-ohdsi-processor-assembly-0.4.1.jar

Usage

1. Feature Generation

  • basic feature generation
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -i=<input_folder> -o=<output_file_name>
  • or without an output file (features.csv in the input folder will be used)
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -i=<input_folder>
  • note the optional 'mode' option
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=features -i=<input_folder> -o=<output_file_name>
  • features generation using custom features, concept categories, or date intervals passed via '-Dconfig.file'
java  -Xms10g -Xmx10g -Xss1M -Dconfig.file=<my_custom_application.conf> -jar ehr-ohdsi-processor-assembly-0.4.1.jar -i=<input_folder> -o=<output_file_name>
  • features generation with time-lag based features
java  -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -with_time_lags= -i=<input_folder> -o=<output_file_name>
  • features generation with time-lag based features and dynamic scores' weights export
java  -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -with_time_lags= -i=<input_folder> -o=<output_file_name> -o-dyn_score_weights=<weight_file_name>
  • features generation with time-lag based features and dynamic scores' weights import
java  -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -with_time_lags= -i=<input_folder> -o=<output_file_name> -i-dyn_score_weights=<weight_file_name>

2. Standardization

  • standardization with comma delimited input files (no spaces)
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -i=<input_files> -o=<output_folder_name>
  • or without an output folder (the respective input folders will be used)
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -i=<input_files>
  • standardization with additional stats output (the generated file is '-std.stats')
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -ostats= -i=<input_files> -o=<output_folder_name>
  • standardization using explicitly passed stats as input (means + stds)
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -istats=<input_stats_file> -i=<input_files> -o=<output_folder_name>
  • standardization including the time-lag based feautures
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -with_time_lags= -i=<input_files> -o=<output_folder_name>

3. Changing the logging configuration

  • by passing a logback file
-Dlogback.configurationFile=<path_to_logback.xml>