Finetune RoBERTa for NER

Finetuning RoBERTa for multilingual Named-entity recognition.

Setup

First, create a python environment. We will use pyenv, but other options will likely work to.

Use th following commands to (1) install a specific python version, (2) create a new virtual environment, (3) activate that environment and (4) install python dependencies.

pyenv install -v 3.10.8
pyenv virtualenv 3.10.8 finetune-transformer
pyenv activate finetune-transformer
pip install -r requirements.txt

Run

Run a notebook headless:

pyenv activate finetune-transformer
jupyter nbconvert --ExecutePreprocessor.timeout=-1 --to notebook --inplace --execute original.ipynb

Execute a python file:

pyenv activate finetune-transformer
nohup python 05_Compile_Dataset.py &

General Information about RoBERTa:

Liu et al. presented an improved BERT variant named RoBERTa. To improve BERT, the authors conducted a series of experiments investigating the impact of training data and training parameters on the downstream performance of BERT. The authors determined that training BERT with a larger batch size and using larger input sequences during pretraining increases downstream performance. Furthermore, during pretraining, the authors forwent the sentence prediction task and used a dynamic method to mask tokens of the input task during the Mask Language phase. In the original work, Devlin et al. statically masked tokens before training the model. Before feeding an input sequence into the model, tokenization needs to be applied. The original BERT paper uses BytePair Encoding (BPE). The input sequence is split up into mixed pieces representing words or only characters. In comparison to a wordlevel-only approach, this enables the representation of a more diverse dataset, which is especially beneficial when training on a multilingual dataset. However, using BPE, the vocabulary size snowballs. Radfort et al. proposed an even more universal and more efficient approach. By splitting up the input sequence on byte level instead per Unicode character as the smallest unit, a universal vocabulary allows for representing any input text with a modest size of 50k units. Unicode-based approaches typically result in vocabulary sized between 10k-100k subword units.

The overall improved performance makes RoBERTa, in many cases, the obvious choice over the original BERT models. Also, the universal input encoding makes RoBERTa more convenient to use in a multilingual setting.

See:

Auto Tokenizer: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer
Auto Model for Token Classification: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer
RoBERTa on Huggingface: https://huggingface.co/docs/transformers/model_doc/roberta
XLM-RoBERTA-large on Huggingface: https://huggingface.co/xlm-roberta-large
BERT in Huggingface: https://huggingface.co/docs/transformers/model_doc/bert
BERT multilingual on Huggingface: https://huggingface.co/bert-base-multilingual-cased

Dataset:

The complete WikiANN dataset includes training examples for 282 languages and was constructed from Wikipedia. Training examples are extracted in an automated manner, exploiting entities mentioned in Wikipedia articles, often are formatted as hyperlinks to the source article. Provided NER tags are in the IOB2 format. Named entities are classified as location (LOC), person (PER), or organization (ORG).