This GitHub repository contains the code relating to the NER model for place name identification from Reddit comments. This model is hosted on the HuggingFace Model Hub, allowing for easy use in Python.
Training monitored using DagsHub and MLFlow.
To retrain the model locally using the WNUT_17 corpus:
python -m src.train --dataset "wnut_17"
Train this model using CoNLL03, CoNLLpp, or OntoNotes 5 corpora:
python -m src.train --dataset "tner/ontonotes5" / "conllpp" / "conll2003"
Note that dvc repro
reproducibly builds this model and uploads it to Hugging Face, if I build future versions.
src
├── common
│ └── utils.py # utility functions
├── pl_data
│ ├── conll_dataset.py # reader for conll format
│ ├── datamodule.py # generic datamodule
│ ├── jsonl_dataset.py # reader for doccano jsonl format
│ └── test_dataset.py # reader for testing dataset
├── pl_metric
│ └── seqeval_f1.py # F1 metric
├── pl_module
│ ├── ger_model.py # model implementation
└── train.py # training script
stages:
train:
cmd: python -m src.train
deps:
- data/doccano_annotated.jsonl
- src/train.py
outs:
- logs
frozen: false
upload:
cmd: python -m src.train --upload=true
deps:
- data/doccano_annotated.jsonl
- src/train.py
frozen: true
flowchart TD
node1["train"]
node2["upload"]