Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings
Paper
If you use this code in your research, please cite the following publication:
Haoran Zhang, Amy X. Lu, Mohamed Abdalla, Matthew McDermott, and Marzyeh Ghassemi. 2020.
Hurtful words: quantifying biases in clinical contextual word embeddings.
In Proceedings of the ACM Conference on Health, Inference, and Learning (CHIL ’20).
Association for Computing Machinery, New York, NY, USA, 110–120.
A publically available version of this paper is also on arXiv.
Pretrained Models
The pretrained BERT models used in our experiments are available to download here:
Step 0: Environment and Prerequisites
- Before starting, go to the MIMIC-benchmarks repository, and follow all of the steps in the
Building a benchmark
section. - Run the following commands to clone this repo and create the Conda environment
git clone https://github.com/MLforHealth/HurtfulWords.git
cd HurtfulWords/
conda create -y -n hurtfulwords python=3.7
conda activate hurtfulwords
pip install -r requirements.txt
Step 1: Data processing
Reads in the tables from MIMIC and pregenerates data for clinical BERT pretraining. Reads in the cohorts defined by MIMIC-benchmarks and creates tasks for finetuning on downstream targets.
- In
bash_scripts/data_processing_pipeline.sh
, updateBASE_DIR
,OUTPUT_DIR
,SCIBERT_DIR
andMIMIC_BENCHMARK_DIR
. - In
scripts/get_data.py
, update the database connection credentials on line 13. If your MIMIC-III is not loaded into a database, you will have to update this script accordingly. - Run
bash_scripts/data_processing_pipeline.sh
. This script will require at least 50 GB of RAM, 100 GB of disk space inOUTPUT_DIR
, and will take several days to complete.
Step 2: Training Baseline Clinical BERT
Pretrains baseline clinical BERT (initialized from SciBERT) for 1 epoch on sequences of length 128, then 1 epoch on sequences of length 512.
- In
bash_scripts/train_baseline_clinical_BERT.sh
, updateBASE_DIR
,OUTPUT_DIR
, andSCIBERT_DIR
. These variables should have the same values as in step 1. - Run
bash_scripts/train_baseline_clinical_BERT.sh
on a GPU cluster. The resultant model will be saved in${OUTPUT_DIR}/models/baseline_clinical_BERT_1_epoch_512/
.
Step 3: Training Adversarial Clinical BERT
Pretrains clinical BERT (initialized from SciBERT) with adversarial debiasing using gender as the protected attribute, for 1 epoch on sequences of length 128, then 1 epoch on sequences of length 512.
- In
bash_scripts/train_adv_clinical_bert.sh
, updateBASE_DIR
,OUTPUT_DIR
, andSCIBERT_DIR
. These variables should have the same values as in step 1. - Run
bash_scripts/train_adv_clinical_bert.sh gender
on a GPU cluster. The resultant model will be saved in${OUTPUT_DIR}/models/adv_clinical_BERT_gender_1_epoch_512/
.
Step 4: Finetuning on Downstream Tasks
Generates static BERT representations for the downstream tasks created in Step 1. Trains various neural networks (grid searching over hyperparameters) on these tasks.
- In
bash_scripts/pregen_embs.sh
, updateBASE_DIR
andOUTPUT_DIR
. Run this script on a GPU cluster. - In
bash_scripts/finetune_on_target.sh
, updateBASE_DIR
andOUTPUT_DIR
. This script will output a trained model for a particular (target, model) combination, in the${OUTPUT_DIR}/models/finetuned/
folder. The Python scriptbash_scripts/run_clinical_targets.py
will queue up the 114 total (target, model) experiments conducted, as Slurm jobs. This script will have to be modified accordingly for other systems.
Step 5: Analyze Downstream Task Results
Evalutes test-set predictions of the trained models, by generating various fairness metrics.
- In
bash_scripts/analyze_results.sh
, updateBASE_DIR
andOUTPUT_DIR
. Run this script, which will output a .xlsx file containing fairness metrics to each of the finetuned model folders. - The Jupyter Notebook
notebooks/MergeResults.ipynb
will read in each of the generated metrics files which can then be viewed in the notebook.
Step 6: Log Probabiltiy Bias Scores
Following procedures in Kurita et al., we calculate the 'log probability bias score' to evaluate biases in the BERT model. Template sentences should be in the example format provided by fill_in_blanks_examples/templates.txt
. A CSV file denoting context key words and the context category should alshould also be suppled (see fill_in_blanks_examples/attributes.csv
).
This step can be done independently of steps 4 and 5.
- In
bash_scripts/log_probability.sh
, updateBASE_DIR
,OUTPUT_DIR
, andMODEL_NAME
. Run this script. - The statistical significance results can be found in
${OUTPUT_DIR}/${MODEL_NAME}_log_scores.tsv
. - The notebook
notebooks/GetBasePrevs.ipynb
computes the base prevalences for categories in the notes.
Step 7: Sentence Completion
scripts/predict_missing.py
takes template sentences which contain _
for tokens to be predicted. Template sentences can be specified directly in the script.
This step can be done independently of steps 1-6.
- In
scripts/predict_missing.py
, updateSCIBERT_DIR
. Run this script in the Conda environment. The results will be printed to the screen.