CKIP Transformers is a set of models for Chinese NLP processing created by CKIP Lab. This repo contains information about a compatible word segmentation model for Hongkongese/Cantonese. It can be loaded and used directly by CKIP Transformers. The only difference between the original repo and this repo is this README file.
There are two versions of this model:
- HK - trained on only Hong Kong text (HKCanCor and CityU)
- HKT - trained on a combination of Hong Kong and Taiwan text (HKCanCor, CityU and AS)
The HK version is slightly better for Hongkongese, while the HKT version is slightly worse for Hongkongese but a lot better for Standard Chinese text. Base and Small sizs are provided for each version. The different in performance is minor so it is recommended to use the Small size unless you found something wrong with it. Different from CKIP Transformers, the base model is ELECTRA and not BERT, see electra-hongkongese. It uses a custom vocabulary that is more suited for Hong Kong text.
NLP Task Models
- ELECTRA HK Small — Word Segmentation:
toastynews/electra-hongkongese-small-hk-ws
- ELECTRA HKT Small — Word Segmentation:
toastynews/electra-hongkongese-small-hkt-ws
- ELECTRA HK Base — Word Segmentation:
toastynews/electra-hongkongese-base-hk-ws
- ELECTRA HKT Base — Word Segmentation:
toastynews/electra-hongkongese-base-hkt-ws
You may use our model directly from the HuggingFace's transformers library.
pip install -U transformers
See the CKIP Transformers repo for complete instructions. The only difference is that it does not use the bert-base-chinese
tokenizer. Simply use AutoTokenizer
from the model.
Instructions for reproducing this model and code to generate training files is in finetune-ckip-transformers.
The WS task is trained on the following datasets:
- HKCanCor - Hong Kong Cantonese Corpus from Nanyang Technological University.
- CityU - Hong Kong news text from City University of Hong Kong. Only the training set is used.
- AS - Variety of Taiwan Chinese text from Academia Sinica. Only the training set is used.
The package also provide the following NLP tools.
- (WS) Word Segmentation
pip install -U ckip-transformers
Requirements:
- Python 3.6+
- PyTorch 1.5+
- HuggingFace Transformers 3.5+
See the CKIP Transformers repo for full instructions. The model can be specified by model_name
and it will be downloaded automatically.
The following are the abridged instructions on how to use this model.
from ckip_transformers.nlp import CkipWordSegmenter
# Initialize drivers
ws_driver = CkipWordSegmenter(model_name="toastynews/electra-hongkongese-base-hkt-ws")
# Input text
text = [
"威院ICU顧問醫生Tom Buckley作供時批評",
"兒子生性病母倍感安慰,獅子山下體現香港精神",
]
# Run pipeline
ws = ws_driver(text)
print(' '.join(ws[0]))
print(' '.join(ws[1]))
威院 ICU 顧問 醫生 Tom Buckley 作供 時 批評
兒子 生 性 病母 倍感 安慰 , 獅子山 下 體現 香港 精神
The following is a performance comparison between this model and the original model:
- UD yue_hk - the yue_hk dataset from Universal Dependencies.
- UD zh_hk - the zh_hk dataset from Universal Dependencies.
- HKCanCor - the same HKCanCor data that this model was trained on. It is only reported for completeness.
- CityU - the test set from the same CityU corpus.
- AS - the test set from the same AS corpus.
Tool | UD yue_hk | UD zh_hk | HKCanCor | CityU | AS |
---|---|---|---|---|---|
CKIP BERT Base | 89.41% | 92.70% | 83.81% | 91.95% | 98.06% |
TN ELECTRA HK Base | 94.62% | 93.30% | 98.95% | 98.06% | 92.25% |
TN ELECTRA HKT Base | 94.04% | 93.27% | 98.75% | 97.66% | 96.52% |
CKIP BERT Tiny | 85.02% | 92.07% | 78.18% | 89.93% | 97.87% |
TN ELECTRA HK Small | 94.68% | 92.77% | 97.69% | 97.50% | 91.87% |
TN ELECTRA HKT Small | 93.89% | 93.14% | 98.07% | 97.12% | 96.44% |
Copyright (c) 2021 CKIP Lab under the GPL-3.0 License.