Thea Rolskov Sloth & Astrid Sletten Rybner
This repository contains code for reproducing our analysis regarding gender bias in Danish pre-trained word embeddings. The steps include: 1. removing gender bias in the word embeddings with hard-debiasing Bolukbasi et al. (2016), and 2. Assessing bias in the word embeddings with the Word Embeddings Association Test (WEAT) Caliskan et al. (2017).
The first part (removing bias) produces a debiased version of the input word-embedding, which is saved to the folder embeddings
.
The second part (assessing bias) produces WEAT scores for two of the gender biases from Caliskan et al. (2017): career-family and math-arts.
The repository includes one folder containing the code for each of these two steps. The output from running each step is saved to the output folder.
Folder | Description |
---|---|
assess_bias |
scripts for assessing bias |
debias |
scripts for removing bias |
embeddings |
folder for original/debiased embeddings |
output |
output folder for output WEAT scores and plots |
This repository contains an example run of debiasing the pre-trained word embedding CONLL-2017 from daNLP. To reproduce the analysis, you need to clone this repository and install the required packages with:
git clone https://github.com/DaDebias/cool_programmer_tshirts2.0
cd /DaDebias/cool_programmer_tshirts2.0
pip install -r requirements.txt
You can then run the pipeline on the CONLL-17 embedding by following the steps below.
You first train a classifier to classify if words in the embedding are gender specific or gender neutral.
cd cool_programmer_tshirts2.0/debias
python learn_gender_specific.py --embedding_filename 'conll17.da.wv' --model_alias 'conll17da'
Using this list you can now debias the word embedding with:
python debias.py --embedding_filename 'conll17.da.wv' --debiased_filename 'debiased_model.bin' --model_alias 'conll17da'
This will produce a debiased version of the word embedding which is saved in the embeddings
folder.
You can now assess bias in the original and debiased word embedding with:
cd cool_programmer_tshirts2.0/assess_bias
python main.py --embedding_filename 'conll17.da.wv' --debiased_filename 'debiased_model.bin' --model_alias 'conll17da'
The analysis included the application of these two steps on the CONLL-17 model from daNLP.
If you wish to try the method on other embeddings you simply replace the --embedding_filename as well as the --model_alias in the above line with either of these embedding names:
- 'conll17.da.wv'
- 'wiki.da.wv'
- 'cc.da.wv'
If you have a downloaded pretrained word embedding as a txt file, you can run the pipeline on this embedding, by placing it in the embeddings
folder and running the steps above and replacing the --embedding_filename as well as the --model_alias argument with the name of the embedding.
If you have any questions regarding the project itself or the code implementation, feel free to contact us via e-mail: Thea Rolskov Sloth & Astrid Sletten Rybner
We would like to give special thanks to the following projects for providing code:
- Bolukbasi et al. (2016) authors of the main code used for debiasing word embeddings.
- Caliskan et al. (2017) author of the articles behind the WEAT test.
- Millie Søndergaard that created the python implementation of the WEAT used for this analysis.
- daNLP for providing pre-trained word embeddings.
- Sofie Ditmer and Astrid Nørgaard for lending us a word embedding trained on the Danish Gigaword corpus.