Create training files to fine-tune Hugging Face models to use with CKIP Transformers. This is part of creating Hongkongese models using the same method.
Hugging Face provides examples for token classification. CKIP Transformers uses BI encoding to indicate word segmentation. For a sentence 點解 啊 ?, the line in the file looks like {"words": ["點", "解", "啊", "?"], "ner": ["B", "I", "B", "B"]}
Fine-tuning with this training data creates a model that can be loaded and used by CKIP Transformers (for non-bert models, some code changes to use different tokenizers will be needed).
- Install PyCantonese to use the HKCanCor dataset
- Download training data from The Second International Chinese Word Segmentation Bakeoff and place cityu_training.utf8 and/or as_training.utf8 in /data
- Run finetune_hkcancor.py and finetune_cityu.py (uncomment some lines if as_training.utf8 is used). finetune_hkcancor.json and finetune_cityu.json will be created
- Merge/shuffle the created files if needed
- Install Hugging Face Transformers from source
- Go to transformers/examples/pytorch/token-classification/
- Run fine-tuning with the training file
python run_ner.py --model_name_or_path toastynews/electra-hongkongese-base-discriminator --train_file finetune_hkcancor.json --output_dir tn_electra_base_hkcancor --do_train
The following software versions were used.
- pycantonese 3.4.0