corpus

There are 848 repositories under corpus topic.

brightmart/nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
9.2k 285 441.5k
dariusk/corpora
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Language:JavaScript4.9k 180 251.3k
CLUEbenchmark/CLUEDatasetSearch
搜索所有中文NLP数据集，附常用英文NLP数据集
Language:Python3.9k 61 12598
CLUEbenchmark/CLUE
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Language:Python3.9k 89 99540
wainshine/Chinese-Names-Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
3.9k 106 27976
adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Language:Python3k 28 321228
endymecy/awesome-deeplearning-resources
Deep Learning and deep reinforcement learning research papers and some codes
2.8k 221 4664
lucasjinreal/weibo_terminater
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Language:Python2.3k 168 55457
candlewill/Dialog_Corpus
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Language:Python2k 84 2501
fendouai/Awesome-Chatbot
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Language:Python2k 104 1407
gunthercox/chatterbot-corpus
A multilingual dialog corpus
Language:Python1.3k 69 811.2k
wainshine/Company-Names-Corpus
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
1.2k 48 5374
chatopera/insuranceqa-corpus-zh
:helicopter: 保险行业语料库，聊天机器人
Language:Python999 56 25340
NiuTrans/Classical-Modern
非常全的文言文（古文）-现代文平行语料
Language:Python925 12 14200
CLUEbenchmark/CLUECorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
889 21 1280
OYE93/Chinese-NLP-Corpus
Collections of Chinese NLP corpus
Language:Python852 15 2207
tensorlayer/seq2seq-chatbot
Chatbot in 200 lines of code using TensorLayer
Language:Python835 41 40316
quanteda/quanteda
An R package for the Quantitative Analysis of Textual Data
Language:R829 53 1.3k185
CLUEbenchmark/CLUEPretrainedModels
高质量中文预训练模型集合：最先进大模型、最快小模型、相似度专门模型
Language:Python792 19 1796
soskek/bookcorpus
Crawl BookCorpus
Language:Python782 17 15109
PlexPt/chatgpt-corpus
ChatGPT 中文语料库对话语料小说语料客服语料用于训练大模型
776 7 4131
mhbashari/awesome-persian-nlp-ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
702 46 10113
CBLUEbenchmark/CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Language:Python679 18 10120
BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Language:Python673 29 2388
ko-nlp/Korpora
Korean corpus repository
Language:Python659 26 10178
nonamestreet/weixin_public_corpus
微信公众号语料库
564 34 7165
chatopera/efaqa-corpus-zh
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Language:Python562 15 080
crownpku/Small-Chinese-Corpus
Some useful Chinese corpus datasets 中文语料小数据
526 33 2161
louisowen6/NLP_bahasa_resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
445 8 1119
several27/FakeNewsCorpus
A dataset of millions of news articles scraped from a curated list of data sources.
377 16 1796
GAIR-NLP/MathPile
Generative AI for Math: MathPile
Language:JavaScript349 8 418
guhhhhaa/4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料
327 8 058
ko-ichi-h/khcoder
KH Coder: for Quantitative Content Analysis or Text Mining
Language:Perl306 21 56295
mesolitica/malaysian-dataset
We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/
Language:Jupyter Notebook288 19 320103
grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Language:Macaulay2255 13 521
lil-lab/nlvr
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Language:HTML251 8 959

corpus

brightmart/nlp_chinese_corpus

dariusk/corpora

CLUEbenchmark/CLUEDatasetSearch

CLUEbenchmark/CLUE

wainshine/Chinese-Names-Corpus

adbar/trafilatura

endymecy/awesome-deeplearning-resources

lucasjinreal/weibo_terminater

candlewill/Dialog_Corpus

fendouai/Awesome-Chatbot

gunthercox/chatterbot-corpus

wainshine/Company-Names-Corpus

chatopera/insuranceqa-corpus-zh

NiuTrans/Classical-Modern

CLUEbenchmark/CLUECorpus2020

OYE93/Chinese-NLP-Corpus

tensorlayer/seq2seq-chatbot

quanteda/quanteda

CLUEbenchmark/CLUEPretrainedModels

soskek/bookcorpus

PlexPt/chatgpt-corpus

mhbashari/awesome-persian-nlp-ir

CBLUEbenchmark/CBLUE

BLKSerene/Wordless

ko-nlp/Korpora

nonamestreet/weixin_public_corpus

chatopera/efaqa-corpus-zh

crownpku/Small-Chinese-Corpus

louisowen6/NLP_bahasa_resources

several27/FakeNewsCorpus

GAIR-NLP/MathPile

guhhhhaa/4675-scifi

ko-ichi-h/khcoder

mesolitica/malaysian-dataset

grammarly/ua-gec

lil-lab/nlvr