This repository contains data and source code of the experiments reported in LREC 2022 paper "Named Entity Recognition in Estonian 19th Century Parish Court Records".
The goal is to experiment with automatic named entity recognition on historical Estonian texts -- on 19th century Estonian parish court records, which have been manually annotated for named entities. In the experiments, we (re)train a traditional machine learning NER approach as a baseline, and finetune different BERT-based transfer learning models for NER.
The folder data contains 19th century Estonian parish court records.
These materials originate the crowdsourcing project of The National Archives of Estonia, and have been manually annotated with named entities in the project "Possibilities of automatic analysis of historical texts by the example of 19th-century Estonian communal court minutes".
The project "Possibilities of automatic analysis of historical texts by the example of 19th-century Estonian communal court minutes" is funded by the national programme "Estonian Language and Culture in the Digital Age 2019-2027".
Python 3.7+ is required. For detailed package requirements, see the file conda_environment.yml (contains conda environment that was used in the experiments).
00_convert_crossval_json_to_conll_train_dev_test.py
-- Converts gold standard NER annotations (in json files) from the format used in Kristjan Poska's experiments to conll NER annotations (in IOB2 format) and splits into train/dev/test datasets. See comments in the header of the script for details.- Note #1: you only need to run this script if you want to make a new (different) data split or if you want to change the tokenization. Otherwise, you can use an existing split from data (
train
,dev
andtest
); - Note #2: the script also outputs statistics of the corpus. For statistics of the last run, see the comment at the end of the script.
- Note #1: you only need to run this script if you want to make a new (different) data split or if you want to change the tokenization. Otherwise, you can use an existing split from data (
01a_estnltk_ner_retraining_best_model.py
-- Retrains the best model from Kristjan Poska's experiments on the new data split.01b_eval_estnltk_ner_best_model_on_dev_test.py
-- Evaluates the previous model ondev
andtest
sets.02a_estnltk_ner_retraining_default_model_baseline.py
-- Trains NER model with EstNLTK's default NER settings on the new data split.02b_eval_estnltk_ner_default_model_on_dev_test.py
-- Evaluates the previous model ondev
andtest
sets.- Note: initially, we wanted to use the best model from Kristjan Poska's experiments as the baseline. However, after retraining and evaluating the model on the new data split (steps
01a
and01b
), its performance turned out to be lower than previously measured, and lower than the performance of retrained EstNLTK's default NER model (steps02a
and02b
). So, we chose the retrained default NER model (steps02a
and02b
) as the new baseline.
- Note: initially, we wanted to use the best model from Kristjan Poska's experiments as the baseline. However, after retraining and evaluating the model on the new data split (steps
03_train_and_eval_bert_model.py
-- Fine-tunes and evaluates BERT-based NER model. First, performs a grid search to find the best configuration of hyperparameters for training. Then fine-tunes the model with the best configuration for 10 epochs, keeps and saves the best model (based on F1 score on the 'dev' set), and finally evaluates the best model on the 'test' set.- Assumes that the corresponding models have already been downloaded and unpacked into local directories
'EstBERT'
,'WikiBert-et'
and'est-roberta'
. You can download the models from urls: - The directory name of trainable model should be given as the command line argument of the script, e.g.
python 03_train_and_eval_bert_model.py EstBERT
.
- Assumes that the corresponding models have already been downloaded and unpacked into local directories
Results on the 'test' set:
model | precision | recall | F1-score |
---|---|---|---|
Baseline | 91.57 | 88.18 | 89.84 |
EstBERT | 89.74 | 91.15 | 90.44 |
WikiBERT-et | 91.29 | 91.98 | 91.63 |
Est-RoBERTa | 92.97 | 94.24 | 93.60 |
- logs -- excerpts of training and evaluation log files with the final results.
- results -- detailed evaluation results in json format, using evaluation metrics from the nervaluate package.
- retrain_estnltk_ner -- retrained EstNLTK's NerTagger models from steps
01a
and02a
. bert_models
-- fine-tuned BERT models from step03
. Because BERT models are large, they are not distributed with this repository. However, the best model from our experiments is available from: https://huggingface.co/tartuNLP/est-roberta-hist-ner- How to use the fine-tuned model for text annotation: using_bert_ner_tagger.ipynb
- error_inspection -- contains code for inspecting errors of the best model. The notebook
find_estroberta_ner_errors_on_test_corpus.ipynb
annotates 'test' set with the best model (finetunedest-roberta
), and shows all the differences between gold standard annotations and automatically added annotations. Annotation differences are augmented with their textual contexts to ease the manual inspection.
If you use this dataset or any of the models in your work, please cite us as follows:
@InProceedings{orasmaa-EtAl:2022:LREC,
author = {Orasmaa, Siim and Muischnek, Kadri and Poska, Kristjan and Edela, Anna},
title = {Named Entity Recognition in Estonian 19th Century Parish Court Records},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {5304--5313},
url = {https://aclanthology.org/2022.lrec-1.568}
}