For CCL-2021 Shared Task: 2021-04-01
Official Website: 2021-05-01
Test Date Release: 2021-06-27
Word is the fundamental unit in natural language processing (NLP) and is easy to obtain because of the natural delimiters between words. However, the situation is totally different when dealing with the Chinese language. Chinese sentences consist of continuous characters without the natural delimiters. Therefore, Chinese word segmentation (CWS) has become the first step (pre-processing) of Chinese NLP tasks, which splits Chinese sentences into independent words and significantly affects the quality of downstream Chinese NLP tasks.
Different with the popular CWS benchmark datasets, we extract texts from downstream NLP tasks associated with Chinese. The corpus is open-domain and is closer to the real-world scenario. And we propose a relative fine-grained criterion for adapting downstream NLP tasks. In particular, the consistency of corpus is an inevitable problem on every segmentation benchmark due to high linguistic complexities of languages. Thus, we propose the Consistency Checking to improve the quality of 'Nihao' .
For demonstrating the quality of 'Nihao' corpus, we adopt two outstanding proposals. In particular, we utilize the work by (Huang et al., 2020) for validating the Consistency of the corpus (Note: we adopt the K cross-validation to train the model by the test data itself). And we utilize the work by (Sun and Deng, 2018) as a baseline of unsupervised training.
The supervised baseline:
K Cross-Validation (K=5) | P | R | F |
---|---|---|---|
train-1,2,3,4 ||| test-5 | 0.9855 | 0.9852 | 0.9854 |
train-1,2,3,5 ||| test-4 | 0.9854 | 0.9875 | 0.9865 |
train-1,2,4,5 ||| test-3 | 0.9847 | 0.9863 | 0.9855 |
train-1,3,4,5 ||| test-2 | 0.9850 | 0.9856 | 0.9853 |
train-2,3,4,5 ||| test-1 | 0.9845 | 0.9865 | 0.9855 |
The unsupervised baseline:
Method | P | R | F |
---|---|---|---|
(Sun and Deng, 2018) | 0.8064 | 0.7847 | 0.7954 |
We provide several external resources for a better understanding of 'Nihao' and unsupervised segmentation methods.
- segmentation criterion with the dictionary of Chinese affix. https://github.com/koukaiu/nihao/res/criterion.
- the dictionary of long words in common use, which is modified by (Cai and Zhao, 2016). https://github.com/koukaiu/nihao/res/idioms.
External resources (Download Link)
-DUTNLP Lab: the Natural Language Processing Lab at Dalian University of Technology
Leader: Degen Huang
Members: Kaiyu Huang, Wei Liu, Hao Yu
-Contact us.
If you have questions, suggestions and bug reports, please email us (unsupervisedCWS@163.com).
- Kaiyu Huang, Degen Huang, Zhuang Liu and Fengran Mo. 2020. A Joint Multiple Criteria Model in Transfer Learning for Cross-domain Chinese Word Segmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3841--3847, Online. Association for Computational Linguistics.
- Zhiqing Sun, Zhi-Hong Deng. 2018. Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4915--4920, Brussels, Belgium. Association for Computational Linguistics.
- Deng Cai and Hai Zhao. 2016. Neural Word Segmentation Learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 409–420, Berlin, Germany. Association for Computational Linguistics.
Copyright © 2021 DUTNLP Lab. All rights reserved.