Multilingual NLP

Introduction

This project implements fine-tuning of a multilingual transformer ("xlm-roberta-base") for the named entity recognition task (NER) in various ways. First, a simple complete fine-tuning and then a PEFT variant, more specifically BitFit. The goal is to compare the effectiveness of the implemented tweak.

Roadmap

Model implementation:
- "xlm-roberta-base" as encoder/body (via AutoModel.from_pretrained)
- Preprocessing needed
- Own classification head using cross entropy loss
- Reference: Token classification guide by huggingface
Train with AdamW
- Fully fine-tuning or BitFit
- Learning rate ~ 2e-5
- Weight decay ~ 0.05
- Sets:
  - ConLL ~ 10 epochs
  - WikiAnn ~ 5 epochs
Eval
- Micro F1 on last checkpoint
- Set(s):
  - Languages part of MasakhaNER
Reference for using multiple sets: Torch Lightning - Managing Data

Usage

Just run the NERRun.py script. Might need a login to your wandb account before (e.g. via the terminal) and possibly might need a predefined cache structure at a given location (see NERDataModule.py) defined via an .env at root (i.e. CACHE_DIR=...).

Issues

The project evaluates on a micro f1 basis which heavily favors the outside tag (which is not really desirable). If anyone uses the repo, they should consider using macro f1 instead (or exclude the O tag from evaluation). In my case, it is/was not really necessary as the direct comparison is sufficient to get a first impression on the effectiveness.

Also, the plot creation is also done in a messy copy paste fashion, don't mind it :P

vasqu/peft-ner-masakhaner

Multilingual NLP

Introduction

Roadmap

Usage

Issues