Code base used for NLP project 2020.
By Daniel de Vassimon Manela, Boris van Breugel, Tom Fisher, David Errington
process_ontonotes.ipynb
: Loads the Ontonotes Release 5.0 data from Github, and processes raw data into a suitable formatting for modelling. Notebook is also dependent on data loaded from UCLA NLP Group. Both these files are loaded within the notebook. Running notebook back-to-front returnsoriginal_data.csv
andflipped_data.csv
, which are used in training a BERT Masked Language Model.BERT_fine_tuning_full_ontonotes.ipynb
: Loads the output ofprocess_ontonotes.ipynb
and trains a BERT Masked Language Model using Huggingface directory. Be sure to correctly comment/uncomment out the lines for loading the data. Data Augmented models require loadeing in bothoriginal_data.csv
andflipped_data.csv
. To train a regular unaugmented finetuned models, only loadoriginal_data.csv
.