Exam Project: Natural Language Processing

MSc Cognitive Science 2021

Thea Rolskov Sloth & Astrid Sletten Rybner

Project information

This repository contains code for reproducing our analysis regarding gender bias in Danish pre-trained word embeddings. The steps include: 1. removing gender bias in the word embeddings with hard-debiasing Bolukbasi et al. (2016), and 2. Assessing bias in the word embeddings with the Word Embeddings Association Test (WEAT) Caliskan et al. (2017).

The first part (removing bias) produces a debiased version of the input word-embedding, which is saved to the folder embeddings.

The second part (assessing bias) produces WEAT scores for two of the gender biases from Caliskan et al. (2017): career-family and math-arts.

Repository structure

The repository includes one folder containing the code for each of these two steps. The output from running each step is saved to the output folder.

Folder	Description
`assess_bias`	scripts for assessing bias
`debias`	scripts for removing bias
`embeddings`	folder for original/debiased embeddings
`output`	output folder for output WEAT scores and plots

Usage

This repository contains an example run of debiasing the pre-trained word embedding CONLL-2017 from daNLP. To reproduce the analysis, you need to clone this repository and install the required packages with:

git clone https://github.com/DaDebias/cool_programmer_tshirts2.0
cd /DaDebias/cool_programmer_tshirts2.0
pip install -r requirements.txt

You can then run the pipeline on the CONLL-17 embedding by following the steps below.

1. Debias

You first train a classifier to classify if words in the embedding are gender specific or gender neutral.

cd cool_programmer_tshirts2.0/debias
python learn_gender_specific.py --embedding_filename 'conll17.da.wv' --model_alias 'conll17da'

Using this list you can now debias the word embedding with:

python debias.py --embedding_filename 'conll17.da.wv' --debiased_filename 'debiased_model.bin' --model_alias 'conll17da'

This will produce a debiased version of the word embedding which is saved in the embeddings folder.

2. Assess bias

You can now assess bias in the original and debiased word embedding with:

cd cool_programmer_tshirts2.0/assess_bias
python main.py --embedding_filename 'conll17.da.wv' --debiased_filename 'debiased_model.bin' --model_alias 'conll17da'

Debiasing other word embeddings

The analysis included the application of these two steps on the CONLL-17 model from daNLP.

If you wish to try the method on other embeddings you simply replace the --embedding_filename as well as the --model_alias in the above line with either of these embedding names:

'conll17.da.wv'
'wiki.da.wv'
'cc.da.wv'

If you have a downloaded pretrained word embedding as a txt file, you can run the pipeline on this embedding, by placing it in the embeddings folder and running the steps above and replacing the --embedding_filename as well as the --model_alias argument with the name of the embedding.

Contact details

If you have any questions regarding the project itself or the code implementation, feel free to contact us via e-mail: Thea Rolskov Sloth & Astrid Sletten Rybner

Acknowledgements

We would like to give special thanks to the following projects for providing code:

Bolukbasi et al. (2016) authors of the main code used for debiasing word embeddings.
Caliskan et al. (2017) author of the articles behind the WEAT test.
Millie Søndergaard that created the python implementation of the WEAT used for this analysis.
daNLP for providing pre-trained word embeddings.
Sofie Ditmer and Astrid Nørgaard for lending us a word embedding trained on the Danish Gigaword corpus.

AstridSlet/cool_programmer_tshirts2.0