corpus
There are 848 repositories under corpus topic.
brightmart/nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
dariusk/corpora
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
CLUEbenchmark/CLUEDatasetSearch
搜索所有中文NLP数据集,附常用英文NLP数据集
CLUEbenchmark/CLUE
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
wainshine/Chinese-Names-Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
endymecy/awesome-deeplearning-resources
Deep Learning and deep reinforcement learning research papers and some codes
lucasjinreal/weibo_terminater
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
candlewill/Dialog_Corpus
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
fendouai/Awesome-Chatbot
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
gunthercox/chatterbot-corpus
A multilingual dialog corpus
wainshine/Company-Names-Corpus
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
chatopera/insuranceqa-corpus-zh
:helicopter: 保险行业语料库,聊天机器人
NiuTrans/Classical-Modern
非常全的文言文(古文)-现代文平行语料
CLUEbenchmark/CLUECorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
OYE93/Chinese-NLP-Corpus
Collections of Chinese NLP corpus
tensorlayer/seq2seq-chatbot
Chatbot in 200 lines of code using TensorLayer
quanteda/quanteda
An R package for the Quantitative Analysis of Textual Data
CLUEbenchmark/CLUEPretrainedModels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
soskek/bookcorpus
Crawl BookCorpus
PlexPt/chatgpt-corpus
ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型
mhbashari/awesome-persian-nlp-ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
CBLUEbenchmark/CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
ko-nlp/Korpora
Korean corpus repository
chatopera/efaqa-corpus-zh
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
crownpku/Small-Chinese-Corpus
Some useful Chinese corpus datasets 中文语料小数据
louisowen6/NLP_bahasa_resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
several27/FakeNewsCorpus
A dataset of millions of news articles scraped from a curated list of data sources.
GAIR-NLP/MathPile
Generative AI for Math: MathPile
guhhhhaa/4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
ko-ichi-h/khcoder
KH Coder: for Quantitative Content Analysis or Text Mining
mesolitica/malaysian-dataset
We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/
grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
lil-lab/nlvr
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.