-
Single pass EHR (Electronic Health Record) feature generation and extraction pipeline using OHDSI standardized tables/csvs and concepts.
-
The processor is implemented using Akka streaming technology, which contributes to memory efficient, fast, and scalable asynchronous processing. The core abstraction is a so-called flow: an in-and-out foldable stream attached to a csv source with prefiltered values based on matching date time intervals. Flows for different features are bundled where the input parsed values are broadcasted, zipped and unzipped accordingly.
-
Features are generated based on JSON-like settings in
application.conf
, which can be freely altered and rerun without touching or recompiling the code. -
Supported features types are:
-
Counts - number of records per date interval.
-
Distinct Counts - number of distinct values within a given column (e.g., concept ids and care sites) per date interval.
-
Concept Category Counts - number of records whose concept ids for a given column belong to a given category per date interval.
-
Concept Category Exist Flags - binary flag indicating whether there exists at least one record with a concept id for a given column belonging to a given category per date interval.
-
Duration - duration from the first occurrence calculated per date interval.
-
Sums - sum of values for a given column and date interval; used purely for drug quantity.
-
Time-lag Features - records (for each table/csv file) are sorted by date, then time lags are calculated and compacted to mean, std, min, max, and dominant relative differential frequency (grouped to 5 bins: -2,-1,0,1,2) indicating the prevalent direction/acceleration of record dates.
-
Comorbidity scores - linear comorbidity measure calculated based on Elixhuser's categories per date interval (reported to have good predictability for short-term mortality). Two versions with different weights were used: AHRQ and van Walraven.
-
Dynamically Calculated Comorbidity Scores - instead of using fixed weights as above, weights are dynamically calculated based on differential relative ratios of the dead patients (withing 6 months) vs others. Two versions were implemented (see bellow): the first one uses the same categories as Elixhauser, the second one takes into account only serious conditions.
-
Non-Aggregate (Static) Features - these features are directly generated from
person.csv
and includegender
(concepts binarized),age_at_last_visit
,year_of_birth
, andmonth_of_birth
.
-
-
Additionally, the processor supports custom date intervals counted from the last visit (for each person), such as last 6 months, last 5-3 years, etc. For each such date interval a set of features are generated.
- JDK 1.8 or higher
To create an executable jar with all dependencies run
sbt assembly
This will produce a file such as ehr-ohdsi-processor-assembly-0.4.1.jar
- basic feature generation
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -i=<input_folder> -o=<output_file_name>
- or without an output file (features.csv in the input folder will be used)
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -i=<input_folder>
- note the optional 'mode' option
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=features -i=<input_folder> -o=<output_file_name>
- features generation using custom features, concept categories, or date intervals passed via '-Dconfig.file'
java -Xms10g -Xmx10g -Xss1M -Dconfig.file=<my_custom_application.conf> -jar ehr-ohdsi-processor-assembly-0.4.1.jar -i=<input_folder> -o=<output_file_name>
- features generation with time-lag based features
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -with_time_lags= -i=<input_folder> -o=<output_file_name>
- features generation with time-lag based features and dynamic scores' weights export
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -with_time_lags= -i=<input_folder> -o=<output_file_name> -o-dyn_score_weights=<weight_file_name>
- features generation with time-lag based features and dynamic scores' weights import
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -with_time_lags= -i=<input_folder> -o=<output_file_name> -i-dyn_score_weights=<weight_file_name>
- standardization with comma delimited input files (no spaces)
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -i=<input_files> -o=<output_folder_name>
- or without an output folder (the respective input folders will be used)
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -i=<input_files>
- standardization with additional stats output (the generated file is '-std.stats')
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -ostats= -i=<input_files> -o=<output_folder_name>
- standardization using explicitly passed stats as input (means + stds)
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -istats=<input_stats_file> -i=<input_files> -o=<output_folder_name>
- standardization including the time-lag based feautures
java -Xms10g -Xmx10g -Xss1M -jar ehr-ohdsi-processor-assembly-0.4.1.jar -mode=std -with_time_lags= -i=<input_files> -o=<output_folder_name>
- by passing a logback file
-Dlogback.configurationFile=<path_to_logback.xml>