ckip-transformers: A Python repository from Shih-Yu-Yeh

CKIP Transformers

This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).

這個專案提供了繁體中文的 transformers 模型（包含 ALBERT、BERT、GPT2）及自然語言處理工具（包含斷詞、詞性標記、實體辨識）。

Contributers

Mu Yang at CKIP (Author & Maintainer).
Wei-Yun Ma at CKIP (Maintainer).

Related Packages

CkipTagger: An alternative Chinese NLP library with using BiLSTM.
CKIP CoreNLP Toolkit: A Chinese NLP library with more NLP tasks and utilities.

Models

You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.

您可於 https://huggingface.co/ckiplab/ 下載預訓練的模型。

Language Models
- ALBERT Tiny: ckiplab/albert-tiny-chinese
- ALBERT Base: ckiplab/albert-base-chinese
- BERT Tiny: ckiplab/bert-tiny-chinese
- BERT Base: ckiplab/bert-base-chinese
- GPT2 Base: ckiplab/gpt2-base-chinese
NLP Task Models
- ALBERT Tiny — Word Segmentation: ckiplab/albert-tiny-chinese-ws
- ALBERT Tiny — Part-of-Speech Tagging: ckiplab/albert-tiny-chinese-pos
- ALBERT Tiny — Named-Entity Recognition: ckiplab/albert-tiny-chinese-ner
- ALBERT Base — Word Segmentation: ckiplab/albert-base-chinese-ws
- ALBERT Base — Part-of-Speech Tagging: ckiplab/albert-base-chinese-pos
- ALBERT Base — Named-Entity Recognition: ckiplab/albert-base-chinese-ner
- BERT Tiny — Word Segmentation: ckiplab/bert-tiny-chinese-ws
- BERT Tiny — Part-of-Speech Tagging: ckiplab/bert-tiny-chinese-pos
- BERT Tiny — Named-Entity Recognition: ckiplab/bert-tiny-chinese-ner
- BERT Base — Word Segmentation: ckiplab/bert-base-chinese-ws
- BERT Base — Part-of-Speech Tagging: ckiplab/bert-base-chinese-pos
- BERT Base — Named-Entity Recognition: ckiplab/bert-base-chinese-ner

Model Usage

You may use our model directly from the HuggingFace's transformers library.

您可直接透過 HuggingFace's transformers 套件使用我們的模型。

pip install -U transformers

Please use BertTokenizerFast as tokenizer, and replace ckiplab/albert-tiny-chinese and ckiplab/albert-tiny-chinese-ws by any model you need in the following example.

請使用內建的 BertTokenizerFast，並將以下範例中的 ckiplab/albert-tiny-chinese 與 ckiplab/albert-tiny-chinese-ws 替換成任何您要使用的模型名稱。

from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
)

# masked language model (ALBERT, BERT)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above

# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

# nlp task model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above

Model Fine-Tunning

To fine tunning our model on your own datasets, please refer to the following example from HuggingFace's transformers.

您可參考以下的範例去微調我們的模型於您自己的資料集。

Remember to set --tokenizer_name bert-base-chinese in order to use Chinese tokenizer.

記得設置 --tokenizer_name bert-base-chinese 以正確的使用中文的 tokenizer。

python run_mlm.py \
   --model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

python run_ner.py \
   --model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

Model Performance

The following is a performance comparison between our model and other models.

The results are tested on a traditional Chinese corpus.

以下是我們的模型與其他的模型之性能比較。

各個任務皆測試於繁體中文的測試集。

Model	#Parameters	Perplexity†	WS (F1)‡	POS (ACC)‡	NER (F1)‡
ckiplab/albert-tiny-chinese	4M	4.80	96.66%	94.48%	71.17%
ckiplab/albert-base-chinese	11M	2.65	97.33%	95.30%	79.47%
ckiplab/bert-tiny-chinese	12M	8.07	96.98%	95.11%	74.21%
ckiplab/bert-base-chinese	102M	1.88	97.60%	95.67%	81.18%
ckiplab/gpt2-base-chinese	102M	8.36	--	--	--

voidful/albert_chinese_tiny	4M	74.93	--	--	--
voidful/albert_chinese_base	11M	22.34	--	--	--
bert-base-chinese	102M	2.53	--	--	--

† Perplexity; the smaller the better.

† 混淆度；數字越小越好。

‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.

‡ WS: 斷詞；POS: 詞性標記；NER: 實體辨識；數字越大越好。

Training Corpus

The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.

以上的語言模型訓練於 ZhWiki 與 CNA 資料集上；斷詞（WS）與詞性標記（POS）任務模型訓練於 ASBC 資料集上；實體辨識（NER）任務模型訓練於 OntoNotes 資料集上。

ZhWiki: https://dumps.wikimedia.org/zhwiki/

Chinese Wikipedia text (20200801 dump), translated to Traditional using OpenCC.

中文維基的文章（20200801 版本），利用 OpenCC 翻譯成繁體中文。
CNA: https://catalog.ldc.upenn.edu/LDC2011T13

Chinese Gigaword Fifth Edition — CNA (Central News Agency) part.

中文 Gigaword 第五版 — CNA（**社）的部分。
ASBC: http://asbc.iis.sinica.edu.tw

Academia Sinica Balanced Corpus of Modern Chinese release 4.0.

**研究院漢語平衡語料庫第四版。
OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19

OntoNotes release 5.0, Chinese part, translated to Traditional using OpenCC.

OntoNotes 第五版，中文部分，利用 OpenCC 翻譯成繁體中文。

Here is a summary of each corpus.

以下是各個資料集的一覽表。

Dataset	#Documents	#Lines	#Characters	Line Type
CNA	2,559,520	13,532,445	1,219,029,974	Paragraph
ZhWiki	1,106,783	5,918,975	495,446,829	Paragraph
ASBC	19,247	1,395,949	17,572,374	Clause
OntoNotes	1,911	48,067	1,568,491	Sentence

Here is the dataset split used for language models.

以下是用於訓練語言模型的資料集切割。

CNA+ZhWiki	#Documents	#Lines	#Characters
Train	3,606,303	18,986,238	4,347,517,682
Dev	30,000	148,077	32,888,978
Test	30,000	151,241	35,216,818

Here is the dataset split used for word segmentation and part-of-speech tagging models.

以下是用於訓練斷詞及詞性標記模型的資料集切割。

ASBC	#Documents	#Lines	#Words	#Characters
Train	15,247	1,183,260	9,480,899	14,724,250
Dev	2,000	52,677	448,964	741,323
Test	2,000	160,012	1,315,129	2,106,799

Here is the dataset split used for word segmentation and named entity recognition models.

以下是用於訓練實體辨識模型的資料集切割。

OntoNotes	#Documents	#Lines	#Characters	#Named-Entities
Train	1,511	43,362	1,367,658	68,947
Dev	200	2,304	93,535	7,186
Test	200	2,401	107,298	6,977

NLP Tools

The package also provide the following NLP tools.

我們的套件也提供了以下的自然語言處理工具。

(WS) Word Segmentation 斷詞
(POS) Part-of-Speech Tagging 詞性標記
(NER) Named Entity Recognition 實體辨識

Installation

pip install -U ckip-transformers

Requirements:

NLP Tools Usage

See here for API details.

詳細的 API 請參見此處。

The complete script of this example is https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py.

以下的範例的完整檔案可參見 https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py 。

1. Import module

from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker

2. Load models

We provide several pretrained models for the NLP tools.

我們提供了一些適用於自然語言工具的預訓練的模型。

# Initialize drivers
ws_driver  = CkipWordSegmenter(model="bert-base")
pos_driver = CkipPosTagger(model="bert-base")
ner_driver = CkipNerChunker(model="bert-base")

One may also load their own checkpoints using our drivers.

也可以運用我們的工具於自己訓練的模型上。

# Initialize drivers with custom checkpoints
ws_driver  = CkipWordSegmenter(model_name="path_to_your_model")
pos_driver = CkipPosTagger(model_name="path_to_your_model")
ner_driver = CkipNerChunker(model_name="path_to_your_model")

To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.

可於宣告斷詞等工具時指定 device 以使用 GPU，設為 -1 （預設值）代表不使用 GPU。

# Use CPU
ws_driver = CkipWordSegmenter(device=-1)

# Use GPU:0
ws_driver = CkipWordSegmenter(device=0)

3. Run pipeline

The input for word segmentation and named-entity recognition must be a list of sentences.

The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).

斷詞與實體辨識的輸入必須是 list of sentences。

詞性標記的輸入必須是 list of list of words。

# Input text
text = [
   "傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。",
   "美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會，預料她將會很順利通過參議院支持，成為該國有史以來第一位的華裔女性內閣成員。",
   "空白 也是可以的～",
]

# Run pipeline
ws  = ws_driver(text)
pos = pos_driver(ws)
ner = ner_driver(text)

The POS driver will automatically segment the sentence internally using there characters '，,。：:；;！!？?' while running the model. (The output sentences will be concatenated back.) You may set delim_set to any characters you want.

You may set use_delim=False to disable this feature, or set use_delim=True in WS and NER driver to enable this feature.

詞性標記工具會自動用 '，,。：:；;！!？?' 等字元在執行模型前切割句子（輸出的句子會自動接回）。可設定 delim_set 參數使用別的字元做切割。

另外可指定 use_delim=False 已停用此功能，或於斷詞、實體辨識時指定 use_delim=True 已啟用此功能。

# Enable sentence segmentation
ws  = ws_driver(text, use_delim=True)
ner = ner_driver(text, use_delim=True)

# Disable sentence segmentation
pos = pos_driver(ws, use_delim=False)

# Use new line characters and tabs for sentence segmentation
pos = pos_driver(ws, delim_set='\n\t')

You may specify batch_size and max_length to better utilize you machine resources.

您亦可設置 batch_size 與 max_length 以更完美的利用您的機器資源。

# Sets the batch size and maximum sentence length
ws = ws_driver(text, batch_size=256, max_length=128)

4. Show results

# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece(sentence_ws, sentence_pos):
   assert len(sentence_ws) == len(sentence_pos)
   res = []
   for word_ws, word_pos in zip(sentence_ws, sentence_pos):
      res.append(f"{word_ws}({word_pos})")
   return "\u3000".join(res)

# Show results
for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
   print(sentence)
   print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
   for entity in sentence_ner:
      print(entity)
   print()

傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。
傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ，(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nd) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ，(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VC) 電視台(Nc) 。(PERIODCATEGORY)
NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))

美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會，預料她將會很順利通過參議院支持，成為該國有史以來第一位的華裔女性內閣成員。
美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ，(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ，(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。(PERIODCATEGORY)
NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
NerToken(word='參議院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='華裔', ner='NORP', idx=(60, 62))

空白 也是可以的～
空白(VH)  (WHITESPACE) 也(D) 是(SHI) 可以(VH) 的(T) ～(FW)

NLP Tools Performance

The following is a performance comparison between our tool and other tools.

以下是我們的工具與其他的工具之性能比較。

CKIP Transformers v.s. Monpa & Jeiba

Tool	WS (F1)	POS (Acc)	WS+POS (F1)	NER (F1)
CKIP BERT Base	97.60%	95.67%	94.19%	81.18%
CKIP ALBERT Base	97.33%	95.30%	93.52%	79.47%
CKIP BERT Tiny	96.98%	95.08%	93.13%	74.20%
CKIP ALBERT Tiny	96.66%	94.48%	92.25%	71.17%

Monpa†	92.58%	--	83.88%	--
Jeiba	81.18%	--	--	--

† Monpa provides only 3 types of tags in NER.

† Monpa 的實體辨識僅提供三種標記而已。

CKIP Transformers v.s. CkipTagger

The following results are tested on a different dataset.†

以下實驗在另一個資料集測試。†

Tool	WS (F1)	POS (Acc)	WS+POS (F1)	NER (F1)
CKIP BERT Base	97.84%	96.46%	94.91%	79.20%
CkipTagger	97.33%	97.20%	94.75%	77.87%

† Here we retrained/tested our BERT model using the same dataset with CkipTagger.

† 我們重新訓練／測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。

Shih-Yu-Yeh/ckip-transformers