Also: 中文 README
Peng-Hsuan Li@CKIP (author/maintainer)
Wei-Yun Ma@CKIP (maintainer)
This open-source library implements neural CKIP-style Chinese NLP tools.
- (WS) word segmentation
- (POS) part-of-speech tagging
- (NER) named entity recognition
Related demo sites
- Performance improvements
- Do not auto delete/change/add characters
- Support indefinitely long sentences
- Support user-defined recommended-word list and must-word list
ASBC 4.0 Test Split (50,000 sentences)
Tool | (WS) prec | (WS) rec | (WS) f1 | (POS) acc |
CkipTagger | 97.49% | 97.17% | 97.33% | 94.59% |
CKIPWS (classic) | 95.85% | 95.96% | 95.91% | 90.62% |
Jieba-zh_TW | 90.51% | 89.10% | 89.80% | -- |
pip install -U ckiptagger[tf,gdown]
CkipTagger is a Python library hosted on PyPI. Requirements:
- python>=3.6
- tensorflow>=1.13.1 / tensorflow-gpu>=1.13.1 (one of them)
- gdown (optional, for downloading model files from google drive)
(Minimum installation) If you have set up tensorflow, and would like to download model files by yourself.
pip install -U ckiptagger
(Complete installation) If you have just set up a clean virtual environment, and want everything, including GPU support.
pip install -U ckiptagger[tfgpu,gdown]
Complete demo script: The following sections assume:
from ckiptagger import data_utils, construct_dictionary, WS, POS, NER
The model files are available on several mirror sites.
You can download and extract to the desired path by one of the included API.
# Downloads to ./ (2GB) and extracts to ./data/
# data_utils.download_data_url("./") # iis-ckip
data_utils.download_data_gdown("./") # gdrive-ckip
- ./data/model_ner/pos_list.txt -> POS tag list, see Wiki / Technical Report no. 93-05
- ./data/model_ner/label_list.txt -> Entity type list, see Wiki / OntoNotes Release 5.0
- ./data/embedding_* -> character/word embeddings, see Wiki
# To use GPU:
# 1. Install tensorflow-gpu (see Installation)
# 2. Set CUDA_VISIBLE_DEVICES environment variable, e.g. os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# 3. Set disable_cuda=False, e.g. ws = WS("./data", disable_cuda=False)
# To use CPU:
ws = WS("./data")
pos = POS("./data")
ner = NER("./data")
You can supply words for WS special consideration, including their relative weights.
word_to_weight = {
"土地公": 1,
"土地婆": 1,
"公有": 2,
"": 1,
"來亂的": "啦",
"緯來體育台": 1,
dictionary = construct_dictionary(word_to_weight)
[(2, {'公有': 2.0}), (3, {'土地公': 1.0, '土地婆': 1.0}), (5, {'緯來體育台': 1.0})]
sentence_list = [
"… 你確定嗎… 不要再騙了……",
word_sentence_list = ws(
# sentence_segmentation = True, # To consider delimiters
# segment_delimiter_set = {",", "。", ":", "?", "!", ";"}), # This is the defualt set of delimiters
# recommend_dictionary = dictionary1, # words in this dictionary are encouraged
# coerce_dictionary = dictionary2, # words in this dictionary are forced
pos_sentence_list = pos(word_sentence_list)
entity_sentence_list = ner(word_sentence_list, pos_sentence_list)
del ws
del pos
del ner
def print_word_pos_sentence(word_sentence, pos_sentence):
assert len(word_sentence) == len(pos_sentence)
for word, pos in zip(word_sentence, pos_sentence):
print(f"{word}({pos})", end="\u3000")
for i, sentence in enumerate(sentence_list):
print_word_pos_sentence(word_sentence_list[i], pos_sentence_list[i])
for entity in sorted(entity_sentence_list[i]):
傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ,(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nf) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VJ) 電視台(Nc) 。(PERIODCATEGORY)
(0, 3, 'PERSON', '傅達仁')
(18, 22, 'DATE', '20年前')
(23, 28, 'ORG', '緯來體育台')
美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ,(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ,(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。(PERIODCATEGORY)
(0, 2, 'GPE', '美國')
(2, 5, 'ORG', '參議院')
(7, 9, 'DATE', '今天')
(11, 13, 'PERSON', '布什')
(17, 21, 'ORG', '勞工部長')
(21, 24, 'PERSON', '趙小蘭')
(42, 45, 'ORG', '參議院')
(56, 58, 'ORDINAL', '第一')
(60, 62, 'NORP', '華裔')
(0, 3, 'PERSON', '土地公')
'… 你確定嗎… 不要再騙了……'
最多(VH) 容納(VJ) 59,000(Neu) 個(Nf) 人(Na) ,(COMMACATEGORY) 或(Caa) 5.9萬(Neu) 人(Na) ,(COMMACATEGORY) 再(D) 多(D) 就(D) 不行(VH) 了(T) .(PERIODCATEGORY) 這(Nep) 是(SHI) 環評(Na) 的(DE) 結論(Na) .(PERIODCATEGORY)
(4, 10, 'CARDINAL', '59,000')
(14, 18, 'CARDINAL', '5.9萬')
科長(Na) 說(VE) :1,(Neu) 坪數(Na) 對(P) 人數(Na) 為(VG) 1:3(Neu) 。(PERIODCATEGORY) 2(Neu) ,(COMMACATEGORY) 可以(D) 再(D) 增加(VHC) 。(PERIODCATEGORY)
(4, 6, 'CARDINAL', '1,')
(12, 13, 'CARDINAL', '1')
(14, 15, 'CARDINAL', '3')
(16, 17, 'CARDINAL', '2')
Please see:
Peng-Hsuan Li, Tsu-Jui Fu, and Wei-Yun Ma. 2020. Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI/arXiv).
Copyright (c) 2019 CKIP Lab.
This Work is licensed under the GNU General Public License v3.0 without any warranties. The license text in full can be getting access at the file named COPYING-GPL-3.0. Any person obtaining a copy of this Work and associated documentation files is granted the rights to use, copy, modify, merge, publish, and distribute the Work for any purpose. However if any work is based upon this Work and hence constitutes a Derivative Work, the GPL-3.0 license requires distributions of the Work and the Derivative Work to remain under the same license or a similar license with the Source Code provision obligation.
For commercial license without the Source Code conveying liability, please contact <ckiptagger_cm [at]>
For other questions, please contact <ckiptagger [at]>