/ckip-transformers-hk

Hongkongese/Cantonese models compatible with CKIP Transformers

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

CKIP Transformers HK

CKIP Transformers is a set of models for Chinese NLP processing created by CKIP Lab. This repo contains information about a compatible word segmentation model for Hongkongese/Cantonese. It can be loaded and used directly by CKIP Transformers. The only difference between the original repo and this repo is this README file.

Models

There are two versions of this model:

  • HK - trained on only Hong Kong text (HKCanCor and CityU)
  • HKT - trained on a combination of Hong Kong and Taiwan text (HKCanCor, CityU and AS)

The HK version is slightly better for Hongkongese, while the HKT version is slightly worse for Hongkongese but a lot better for Standard Chinese text. Base and Small sizs are provided for each version. The different in performance is minor so it is recommended to use the Small size unless you found something wrong with it. Different from CKIP Transformers, the base model is ELECTRA and not BERT, see electra-hongkongese. It uses a custom vocabulary that is more suited for Hong Kong text.

NLP Task Models

Model Usage

You may use our model directly from the HuggingFace's transformers library.

pip install -U transformers

See the CKIP Transformers repo for complete instructions. The only difference is that it does not use the bert-base-chinese tokenizer. Simply use AutoTokenizer from the model.

Model Fine-Tunning

Instructions for reproducing this model and code to generate training files is in finetune-ckip-transformers.

Training Corpus

The WS task is trained on the following datasets:

  • HKCanCor - Hong Kong Cantonese Corpus from Nanyang Technological University.
  • CityU - Hong Kong news text from City University of Hong Kong. Only the training set is used.
  • AS - Variety of Taiwan Chinese text from Academia Sinica. Only the training set is used.

NLP Tools

The package also provide the following NLP tools.

  • (WS) Word Segmentation

Installation

pip install -U ckip-transformers

Requirements:

NLP Tools Usage (abridged)

See the CKIP Transformers repo for full instructions. The model can be specified by model_name and it will be downloaded automatically. The following are the abridged instructions on how to use this model.

1. Import module

from ckip_transformers.nlp import CkipWordSegmenter

2. Load models

We currently only support the word segmenter from the NLP tools.
# Initialize drivers
ws_driver  = CkipWordSegmenter(model_name="toastynews/electra-hongkongese-base-hkt-ws")

3. Run pipeline

The input for word segmentation must be a list of sentences.
# Input text
text = [
   "威院ICU顧問醫生Tom Buckley作供時批評",
   "兒子生性病母倍感安慰,獅子山下體現香港精神",
]

# Run pipeline
ws  = ws_driver(text)

4. Show results

print(' '.join(ws[0]))
print(' '.join(ws[1]))
威院 ICU 顧問 醫生 Tom  Buckley 作供 時 批評
兒子 生 性 病母 倍感 安慰 , 獅子山 下 體現 香港 精神

NLP Tools Performance

The following is a performance comparison between this model and the original model:

  • UD yue_hk - the yue_hk dataset from Universal Dependencies.
  • UD zh_hk - the zh_hk dataset from Universal Dependencies.
  • HKCanCor - the same HKCanCor data that this model was trained on. It is only reported for completeness.
  • CityU - the test set from the same CityU corpus.
  • AS - the test set from the same AS corpus.

Word Segmentation Performance (F1)

Tool UD yue_hk UD zh_hk HKCanCor CityU AS
CKIP BERT Base 89.41% 92.70% 83.81% 91.95% 98.06%
TN ELECTRA HK Base 94.62% 93.30% 98.95% 98.06% 92.25%
TN ELECTRA HKT Base 94.04% 93.27% 98.75% 97.66% 96.52%
CKIP BERT Tiny 85.02% 92.07% 78.18% 89.93% 97.87%
TN ELECTRA HK Small 94.68% 92.77% 97.69% 97.50% 91.87%
TN ELECTRA HKT Small 93.89% 93.14% 98.07% 97.12% 96.44%

License

GPL-3.0

Copyright (c) 2021 CKIP Lab under the GPL-3.0 License.