CORE-BEHRT aims to provide a robust framework for reproducing state-of-the-art results in EHR data analysis using a BEHRT-like model. Follow the steps below to reproduce the results.
Before you begin, ensure you have the necessary dependencies installed. This project may require:
- PyTorch
- transformers
- Numpy
- Pandas
- scikit_learn
- tqdm
- matplotlib
- (pyarrow if parquet files are used)
To correctly prepare your data for processing, execute the scripts in the following order. Ensure your data adheres to the specified format before starting:
-
Data Format
- Required Data Format:
- Patient Data: Ensure the file
patient_format.csv
contains columns forPID
,DATE_OF_BIRTH
,DATE_OF_DEATH
,RACE
,GENDER
, and other relevant background features. - Event Data: The files
concept.{code_type}.csv
should includeTIMESTAMP
,PID
,ADMISSION_ID
, andCONCEPT
.
- Patient Data: Ensure the file
- Use the preprocessing tools available at ehr_preprocess to convert your raw data into the required format.
- Required Data Format:
-
Feature Creation and Tokenization
main_create_data
: Stores features as dictionaries with list of lists as values and difference concept data streams as keys (concept, segment, age, abspos,...) holding the patient sequences. Tokenizes the features. Use data_pretrain.yaml config.
-
Model Pre-training
main_pretrain
: Pre-trains a standard a BEHRT model on the tokenized features.
-
Data Preparation for Fine-tuning
main_create_outcomes
: From the formatted data, creates a dictionary with the events of interest (abspos of first time occurrence). Example dictionary: {'PID':['p1', 'p2', ...], 'EVENT1':[5423, None, ...], ...}
-
Model Fine-tuning
main_finetune_cv
: Performs 5-fold cross-validation + evaluation on a holdout-set.
To run the hierarchical version of the model, use the following scripts:
setup_hierarchical
: Utilizes features frommain_create_data
to construct a feature tree and generate hierarchical features.main_h_pretrain
: Trains a BERT model with a hierarchical loss function.
To evaluate the model performance:
main_finetune_cv_get_stats
: Get basic statistics on patient sequences in train/val and test set.main_finetune_cv_test
: Test models trained in CV on a dataset of your choice.main_encode_censored_patients
: Encodes sequences using a trained model, specifically censored on a predetermined event.main_encode_concepts
: Save encodings of concepts.