ckip-transformers-hk: A Python repository from toastynews

CKIP Transformers HK

CKIP Transformers is a set of models for Chinese NLP processing created by CKIP Lab. This repo contains information about a compatible word segmentation model for Hongkongese/Cantonese. It can be loaded and used directly by CKIP Transformers. The only difference between the original repo and this repo is this README file.

Models

There are two versions of this model:

HK - trained on only Hong Kong text (HKCanCor and CityU)
HKT - trained on a combination of Hong Kong and Taiwan text (HKCanCor, CityU and AS)

The HK version is slightly better for Hongkongese, while the HKT version is slightly worse for Hongkongese but a lot better for Standard Chinese text. Base and Small sizs are provided for each version. The different in performance is minor so it is recommended to use the Small size unless you found something wrong with it. Different from CKIP Transformers, the base model is ELECTRA and not BERT, see electra-hongkongese. It uses a custom vocabulary that is more suited for Hong Kong text.

NLP Task Models

ELECTRA HK Small — Word Segmentation: toastynews/electra-hongkongese-small-hk-ws
ELECTRA HKT Small — Word Segmentation: toastynews/electra-hongkongese-small-hkt-ws
ELECTRA HK Base — Word Segmentation: toastynews/electra-hongkongese-base-hk-ws
ELECTRA HKT Base — Word Segmentation: toastynews/electra-hongkongese-base-hkt-ws

Model Usage

You may use our model directly from the HuggingFace's transformers library.

pip install -U transformers

See the CKIP Transformers repo for complete instructions. The only difference is that it does not use the bert-base-chinese tokenizer. Simply use AutoTokenizer from the model.

Model Fine-Tunning

Instructions for reproducing this model and code to generate training files is in finetune-ckip-transformers.

Training Corpus

The WS task is trained on the following datasets:

HKCanCor - Hong Kong Cantonese Corpus from Nanyang Technological University.
CityU - Hong Kong news text from City University of Hong Kong. Only the training set is used.
AS - Variety of Taiwan Chinese text from Academia Sinica. Only the training set is used.

NLP Tools

The package also provide the following NLP tools.

(WS) Word Segmentation

Installation

pip install -U ckip-transformers

Requirements:

NLP Tools Usage (abridged)

See the CKIP Transformers repo for full instructions. The model can be specified by model_name and it will be downloaded automatically. The following are the abridged instructions on how to use this model.

1. Import module

from ckip_transformers.nlp import CkipWordSegmenter

2. Load models

We currently only support the word segmenter from the NLP tools.

# Initialize drivers
ws_driver  = CkipWordSegmenter(model_name="toastynews/electra-hongkongese-base-hkt-ws")

3. Run pipeline

The input for word segmentation must be a list of sentences.

# Input text
text = [
   "威院ICU顧問醫生Tom Buckley作供時批評",
   "兒子生性病母倍感安慰，獅子山下體現香港精神",
]

# Run pipeline
ws  = ws_driver(text)

4. Show results

print(' '.join(ws[0]))
print(' '.join(ws[1]))

威院 ICU 顧問 醫生 Tom  Buckley 作供 時 批評
兒子 生 性 病母 倍感 安慰 ， 獅子山 下 體現 香港 精神

NLP Tools Performance

The following is a performance comparison between this model and the original model:

UD yue_hk - the yue_hk dataset from Universal Dependencies.
UD zh_hk - the zh_hk dataset from Universal Dependencies.
HKCanCor - the same HKCanCor data that this model was trained on. It is only reported for completeness.
CityU - the test set from the same CityU corpus.
AS - the test set from the same AS corpus.

Word Segmentation Performance (F1)

Tool	UD yue_hk	UD zh_hk	HKCanCor	CityU	AS
CKIP BERT Base	89.41%	92.70%	83.81%	91.95%	98.06%
TN ELECTRA HK Base	94.62%	93.30%	98.95%	98.06%	92.25%
TN ELECTRA HKT Base	94.04%	93.27%	98.75%	97.66%	96.52%
CKIP BERT Tiny	85.02%	92.07%	78.18%	89.93%	97.87%
TN ELECTRA HK Small	94.68%	92.77%	97.69%	97.50%	91.87%
TN ELECTRA HKT Small	93.89%	93.14%	98.07%	97.12%	96.44%

toastynews/ckip-transformers-hk