NLP_masters_project

Code base used for NLP project 2020.

By Daniel de Vassimon Manela, Boris van Breugel, Tom Fisher, David Errington

process_ontonotes.ipynb: Loads the Ontonotes Release 5.0 data from Github, and processes raw data into a suitable formatting for modelling. Notebook is also dependent on data loaded from UCLA NLP Group. Both these files are loaded within the notebook. Running notebook back-to-front returns original_data.csv and flipped_data.csv, which are used in training a BERT Masked Language Model.
BERT_fine_tuning_full_ontonotes.ipynb: Loads the output of process_ontonotes.ipynb and trains a BERT Masked Language Model using Huggingface directory as described in Section 4.3 of our paper. Be sure to correctly comment/uncomment out the lines for loading the data. Data Augmented models require loading in both original_data.csv and flipped_data.csv. To train regular unaugmented finetuned models, only load original_data.csv.
bias_analysis.ipynb: Analyses streo and skew bias in out-of-the-box ELMo, BERT, distilBERT and RoBERTa using WinoBias dataset. Also includes Online Skewness Mitigation for different BERTs, visualisations of professional embedding bias, and compares bias in BERT to industry statistics.

Extra Dependencies

We use the Winogender Schemas (in particular, use of the /data/templates.tsv file) for obtaining more general benchmarks of algorithm bias.

The processed examples are stored in processed_wino_data.txt.