Code base used for NLP project 2020.
By Daniel de Vassimon Manela, Boris van Breugel, Tom Fisher, David Errington
process_ontonotes.ipynb
: Loads the Ontonotes Release 5.0 data from Github, and processes raw data into a suitable formatting for modelling. Notebook is also dependent on data loaded from UCLA NLP Group. Both these files are loaded within the notebook. Running notebook back-to-front returnsoriginal_data.csv
andflipped_data.csv
, which are used in training a BERT Masked Language Model.BERT_fine_tuning_full_ontonotes.ipynb
: Loads the output ofprocess_ontonotes.ipynb
and trains a BERT Masked Language Model using Huggingface directory as described in Section 4.3 of our paper. Be sure to correctly comment/uncomment out the lines for loading the data. Data Augmented models require loading in bothoriginal_data.csv
andflipped_data.csv
. To train regular unaugmented finetuned models, only loadoriginal_data.csv
.bias_analysis.ipynb
: Analyses streo and skew bias in out-of-the-box ELMo, BERT, distilBERT and RoBERTa using WinoBias dataset. Also includes Online Skewness Mitigation for different BERTs, visualisations of professional embedding bias, and compares bias in BERT to industry statistics.
We use the Winogender Schemas (in particular, use of the /data/templates.tsv
file) for obtaining more general benchmarks of algorithm bias.
The processed examples are stored in processed_wino_data.txt
.