Chinese-NLP-Corpus

Collections of Chinese NLP corpus

Open Domain

Corpus for open domain, including: law, social media, comments

Name	Description	Link
ZhuXian(诛仙)	小说《诛仙》的POS和分词标注数据	zhuxian
CNLC	国家语言委员会的数据，train: dev: test=8: 1: 1	CNLC

* the url in the table is out-of-date, you can find the data from the following reference.
Reference:https://github.com/hankcs/multi-criteria-cws/tree/master/data
the details of the corpus

Name	Description	Link	notes
CAIL2018	2018‘法研杯’法律智能挑战赛（任务：罪名预测、法条推荐、刑期预测）的数据，数据集共包括268万刑法法律文书，共涉及183条罪名，202条**法条，刑期长短包括0-25年、无期、死刑。	CAIL2018	比赛官网, github
CSL - Classification	中文科学文献数据集(CSL)中，选取自然科学相关学报的论文摘要根据国家自然科学基金进行学科分类。	CSL - Classification

Name	Description	Link	notes
ChnSentiCorp_htl_all	7000多条酒店评论数据，5000多条正面评论，2000多条负面评论	ChnSentiCorp_htl_all
waimai_10k	某外卖平台收集的用户评价，正面4000条，负面约8000条	waimai_10k
online_shopping_10_cats	10个类别（书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店），共6万多条评论数据，正、负面评论各约3万条	online_shopping_10_cats
weibo_senti_100k	10万多条，带情感标注的新浪微博，正负面评论约各5万条	weibo_senti_100k	参考页面，这个数据集里包含大量emoji，效果可能与emoji相关，训练之前最好去除emoji
simplifyweibo_4_moods	36万多条，带情感标注的新浪微博，包含4种情感，其中喜悦约20万条，愤怒、厌恶、低落各约5万条	simplifyweibo_4_moods
dmsc_v2	28部电影，超70万用户，超 200万条评分/评论数据	dmsc_v2
yf_dianping	24万家餐馆，54万用户，440万条评论/评分数据	yf_dianping
yf_amazon	52万件商品，1100多个类目，142万用户，720万条评论/评分数据	yf_amazon
ez_douban	5万多部电影（3万多有电影名称，2万多没有电影名称），2.8万用户，280万条评分数据	ez_douban

Description	Link	notes
Chinese NLP Corpus	https://github.com/SophonPlus/ChineseNlpCorpus
awesome-chinese-nlp/Corpus 中文语料	https://github.com/crownpku/Awesome-Chinese-NLP#corpus-中文语料
Large Scale Chinese Corpus for NLP	https://github.com/brightmart/nlp_chinese_corpus
中文自然语言处理数据集	https://github.com/InsaneLife/ChineseNLPCorpus
funNLP	https://github.com/fighting41love/funNLP

Collect corpus for Chinese medical domain, including medical terminology, QA, clinical NER

Name	Description	Link	notes
ChineseBLUE	the Chinese Biomedical Language Understanding Evaluation benchmark by alibaba	ChineseBLUE	Conceptualized Representation Learning for Chinese Biomedical Text Mining

Name	Description	Link	notes
AMTTL	医学语言的分词数据集，来源应该是医学论坛，所以数据还是偏向open，与医学文本中的语言描述有差异。	AMTTL	Adaptive Multi-Task Transfer Learning for Chinese Word Segmentation in Medical Text

Name	Description	Link	notes
CNMER	中文医学实体识别数据集，实体包括身体部位、症状体征、检查、疾病以及治疗。	CNMER	应该是CCKS2017的数据。
CNMER	识别疾病和诊断、解剖部位、影像检查、实验室检验、手术和药物6种命名实体	CCKS2018数据
CNMER	识别中文医学命名实体	CCKS2019数据	来自OpenKG的分享

Name	Description	Link	notes
cMedQA	医学在线论坛的数据，包含5.4万个问题，及对应的约10万个回答。	cMedQA	Chinese Medical Question Answer Matching Using End-to-End Character-Level Multi-Scale CNNs
cMedQA2	cMedQA的扩展版，包含约10万个医学相关问题，及对应的约20万个回答。	cMedQA2	Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection
webMedQA	又一个医学在线问答数据集，包含6万个问题和31万个回答，而且包含问题的类别。	webMedQA	Applying deep matching networks to Chinese medical question answering: A study and a dataset

Name	Description	Link
medical-books	Open sourece medical books in LaTeX	medical-books
awesome_Chinese_medical_NLP	中文医学NLP公开资源整理	awesome_Chinese_medical_NLP
Chinese_medical_NLP	医疗NLP领域（主要关注中文）评测数据集与论文等相关资源。	Chinese_medical_NLP