PaniniQA

Repo for the TACL 2023 paper "PaniniQA: Enhancing Patient Education Through Interactive Question Answering"

1. Dataset

We open source two datasets:

Dataset 1 - 456 annotated discharge instructions from MIMIC-III Clinical Database
Dataset 2 - 100 synthesized discharge instructions generated by pre-trained neural models

Detailed instructions on Dataset 1

The 456 discharge instructions in Dataset 1 are from the MIMIC-III Clinical Database, a large freely-available database comprising deidentified health-related data associated with patients who stayed in critical care units of the Beth Israel Deaconess Medical Center. We provide all the annotation files in the folder data/annotated_dataset/annotated_files/. There are two types of files in this folder:

Files ending with evt.csv records key medical events in each discharge instruction.
Files ending with rel.txt records key medical relations in each discharge instruction. Each file is represented by a unique identifier in the form of row_id-subject_id-hadm_id.

Due to data security agreement, we can not release the discharge instructions in this repo, you will need to acquire the discharge instructions from MIMIC-III yourself. To acquire these discharge instructions, please first obtain the credential from here. After acquiring the credential, please visit this link to download the file NOTEEVENTS.csv.gz.

Then run the following command to extract the clinical instructions from the downloaded file:

python scripts/data_process/extract_mimic.py \
  --input_file PATH/TO/NOTEEVENTS.csv \
  --anno_dir data/annotated_dataset/annotated_files/ \
  --output_dir data/annotated_dataset/raw_notes/

Once you have finished running the above command, you should be able to see 456 txt files (discharge instructions) in your folder data/annotated_dataset/raw_notes/

Creating train/validation/test sets for Key Medical Event Identification

Creating train/validation/test sets for Key Medical Relation Identification

You may use the following command to create the train / validation / test set for medical relation classification:

CUDA_VISIBLE_DEVICES=0 python scripts/data_process/process_rel_cls.py \
  --anno_dir data/annotated_dataset/annotated_files/ \
  --note_dir data/annotated_dataset/raw_notes/ \
  --split_file data/annotated_dataset/split.json \
  --output_dir data/rel_cls/

This command will generate the datasets for relation classification in the directory data/rel_cls/.

Detailed instructions on Dataset 2

We provide 30 synthesized discharge instruction in data/synthesized_dataset/raw_notes/. These discharge instructions were generated using the models in this repo. We also provide the human annotated cloze questions in data/synthesized_dataset/cloze/.

2. Identifying Key Medical Events

3. Identifying Key Medical Relations

To train and evaluate the performance of medical relation classification model, run the following command:

CUDA_VISIBLE_DEVICES=0 python run_classification.py \
  --model_name_or_path path/to/pre-trained/model \  # We use the RoBERTa-large-PM-M3-Voc-hf from the following site: https://github.com/facebookresearch/bio-lm
  --train_file data/rel_cls/train.json \
  --validation_file data/rel_cls/test.json \
  --max_length 512 \
  --per_device_train_batch_size 8 \
  --learning_rate 2e-5 \
  --num_train_epochs 5 \
  --output_dir path/to/output/directory \
  --with_tracking \
  --pos_weight 1.5

pengshancai/PaniniQA