/cantonese-nlp-benchmark

Benchmark for Cantonese word segmentation and pos tagging

Primary LanguagePythonMIT LicenseMIT

cantonese-nlp-benchmark

This repo contains Cantonese word segmentation and part-of-speech tagging benchmark code and scores. It uses the metrics returned by the spaCy Scorer.

Benchmark Datasets

NLP Tools Compared

Scores

UD Cantonese HK

ud_hk pos_acc token_f token_p token_r
spaCy sm 0.60 0.72 0.74 0.69
spaCy trf 0.71 0.72 0.74 0.69
pkuseg 0.61 0.83 0.84 0.82
cantoseg 0.37 0.86 0.87 0.85
jieba 0.40 0.82 0.81 0.84
PyCantonese 0.74 0.86 0.87 0.85
CKIP 0.77 0.89 0.89 0.90

UD Chinese HK

ud_hk pos_acc token_f token_p token_r
spaCy sm 0.69 0.82 0.83 0.81
spaCy trf 0.80 0.82 0.83 0.81
pkuseg 0.71 0.92 0.93 0.90
cantoseg 0.49 0.84 0.86 0.81
jieba 0.47 0.84 0.87 0.82
PyCantonese 0.65 0.84 0.85 0.83
CKIP 0.81 0.93 0.93 0.92

Hong Kong Cantonese Corpus

PyCantonese was trained on this corpus and so this is not a fair test for it.

ud_hk pos_acc token_f token_p token_r
spaCy sm 0.51 0.64 0.68 0.60
spaCy trf 0.61 0.64 0.68 0.60
pkuseg 0.39 0.76 0.78 0.74
cantoseg 0.38 0.90 0.93 0.87
jieba 0.36 0.80 0.79 0.81
PyCantonese 0.91 0.90 0.93 0.87
CKIP 0.64 0.84 0.83 0.85

Reproduce

Download the UD datasets, run spaCy convert with default options and place the files inside ./data.

To run jieba tests, the dictionaries are assumed to be inside ./jieba.

Versions

The following software versions were used to produce the numbers above.