This repo contains Cantonese word segmentation and part-of-speech tagging benchmark code and scores. It uses the metrics returned by the spaCy Scorer.
- UD Cantonese HK
- UD Chinese HK
- Hong Kong Cantonese Corpus as provided by PyCantonese
- spaCy (zh_core_web_sm) - pkuseg trained on OntoNotes
- spaCy (zh_core_web_trf) - pkuseg trained on OntoNotes
- pkuseg
- cantoseg - jieba with normalized Cantonese dictionary
- jieba (big)
- PyCantonese
- CKIP Transformers
ud_hk | pos_acc | token_f | token_p | token_r |
---|---|---|---|---|
spaCy sm | 0.60 | 0.72 | 0.74 | 0.69 |
spaCy trf | 0.71 | 0.72 | 0.74 | 0.69 |
pkuseg | 0.61 | 0.83 | 0.84 | 0.82 |
cantoseg | 0.37 | 0.86 | 0.87 | 0.85 |
jieba | 0.40 | 0.82 | 0.81 | 0.84 |
PyCantonese | 0.74 | 0.86 | 0.87 | 0.85 |
CKIP | 0.77 | 0.89 | 0.89 | 0.90 |
ud_hk | pos_acc | token_f | token_p | token_r |
---|---|---|---|---|
spaCy sm | 0.69 | 0.82 | 0.83 | 0.81 |
spaCy trf | 0.80 | 0.82 | 0.83 | 0.81 |
pkuseg | 0.71 | 0.92 | 0.93 | 0.90 |
cantoseg | 0.49 | 0.84 | 0.86 | 0.81 |
jieba | 0.47 | 0.84 | 0.87 | 0.82 |
PyCantonese | 0.65 | 0.84 | 0.85 | 0.83 |
CKIP | 0.81 | 0.93 | 0.93 | 0.92 |
PyCantonese was trained on this corpus and so this is not a fair test for it.
ud_hk | pos_acc | token_f | token_p | token_r |
---|---|---|---|---|
spaCy sm | 0.51 | 0.64 | 0.68 | 0.60 |
spaCy trf | 0.61 | 0.64 | 0.68 | 0.60 |
pkuseg | 0.39 | 0.76 | 0.78 | 0.74 |
cantoseg | 0.38 | 0.90 | 0.93 | 0.87 |
jieba | 0.36 | 0.80 | 0.79 | 0.81 |
PyCantonese | 0.91 | 0.90 | 0.93 | 0.87 |
CKIP | 0.64 | 0.84 | 0.83 | 0.85 |
Download the UD datasets, run spaCy convert with default options and place the files inside ./data.
To run jieba tests, the dictionaries are assumed to be inside ./jieba.
The following software versions were used to produce the numbers above.
- spacy 3.3.1
- pkuseg library built from Jun 7, 2022 commit with the default_v2 dictionary
- cantoseg dictionary built from Aug 22, 2020 commit
- jieba 0.42.1
- pycantonese 3.4.0
- ckip-transformers 0.3.2