finetune-ckip-transformers

Create training files to fine-tune Hugging Face models to use with CKIP Transformers. This is part of creating Hongkongese models using the same method.

Overview

Hugging Face provides examples for token classification. CKIP Transformers uses BI encoding to indicate word segmentation. For a sentence 點解啊 ?, the line in the file looks like {"words": ["點", "解", "啊", "?"], "ner": ["B", "I", "B", "B"]}

Fine-tuning with this training data creates a model that can be loaded and used by CKIP Transformers (for non-bert models, some code changes to use different tokenizers will be needed).

Instructions

Install PyCantonese to use the HKCanCor dataset
Download training data from The Second International Chinese Word Segmentation Bakeoff and place cityu_training.utf8 and/or as_training.utf8 in /data
Run finetune_hkcancor.py and finetune_cityu.py (uncomment some lines if as_training.utf8 is used). finetune_hkcancor.json and finetune_cityu.json will be created
Merge/shuffle the created files if needed
Install Hugging Face Transformers from source
Go to transformers/examples/pytorch/token-classification/
Run fine-tuning with the training file python run_ner.py --model_name_or_path toastynews/electra-hongkongese-base-discriminator --train_file finetune_hkcancor.json --output_dir tn_electra_base_hkcancor --do_train

Versions

The following software versions were used.

pycantonese 3.4.0

toastynews/finetune-ckip-transformers

finetune-ckip-transformers

Overview

Instructions

Versions