corpus-data
There are 163 repositories under corpus-data topic.
esbatmop/MNBVC
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
PlexPt/chatgpt-corpus
ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型
shijiebei2009/CEC-Corpus
:books:中文突发事件语料库(Chinese Emergency Corpus)-上海大学-语义智能实验室
sheepzh/poetry
地球上最全的华语现代诗歌语料库,3k+诗人,80K+诗歌,15M+字
gkiril/oie-resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
guhhhhaa/4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
guhhhhaa/wula-scifi
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
NathanDuran/Switchboard-Corpus
Utilities for Processing the Switchboard Dialogue Act Corpus
dataset-vn/DANeS
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
zonghui0228/BioMedical-NLP-corpus
Biomedical NLP Corpus or Datasets.
1837669410/bilibili_comment_crawl
爬取bilibili视频下的评论,最新出品!!!⚠本代码只适用于学习,做其他事情概不负责!!!
hailiang-wang/egret-wenda-corpus
A Public Corpus for Machine Learning
johentsch/ms3
A parser for annotated MuseScore 3 files.
shijiebei2009/CEEC-Corpus
:books:中文环境突发事件语料库(Chinese Environment Emergency Corpus)-上海大学-语义智能实验室
KehaoWu/Jinyong-Corpus
金庸15部小说字典
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
CanCLID/canto-filter
粵文語料篩選器 Cantonese text filter
jaaack-wang/ccnc
CCNC: A Comprehensive Chinese Name Corpus (3.65M name samples). 大型中文姓名语料库 (内含365万姓名语例)。
NathanDuran/MRDA-Corpus
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
PolMine/GermaParlTEI
GermaParl: Corpus of Plenary Protocols of the German Bundestag (TEI Format)
yc9701/pansori-tedxkr-corpus
Korean ASR Corpus generated from TEDx talks
ziegler-ingo/CRAFT
Code, datasets, and checkpoints for the paper "CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation"
fbougares/TSAC
Tunisian Sentiment Analysis Corpus.
spianmo/MultiClassify_LSTM_ForChinese
Use Bi-LSTM neural network to classify Chinese text sentiment, including eight categories (like, disgust, happiness, sadness, anger, surprise, fear, none)
PakUrdu-Research-Center/awesome-urdu
Repository dedicated to a collection of resources and helping material for Urdu language Processing related tasks
undertheseanlp/corpus.viwiki
Vietnamese Wikipedia Corpus
luciamariaalvarezcrespo/GalMisoCorpus2023
:bookmark_tabs: Galician corpus for misogyny detection
maxent-ai/Datasets
datasets with text data for use in NLP, Text analysis, information extraction, ML research.
wiragotama/TIARA-annotationTool
An Interactive Tool for Annotating Discourse Structure and Text Improvement
PyThaiNLP/thaigov-v2-corpus
Thai News Dataset from Thai government website.
UIUCLearningLanguageLab/AOCHILDES
Python API for loading language data from American-English CHILDES database
filipefilardi/text-mining
Clean corpus generic script made with tm package
MarsPanther/crawl-for-parallel-corpora
simple bs4 based web crawl for a corpus in need of statistical machine translation