corpus-data

There are 168 repositories under corpus-data topic.

esbatmop/MNBVC
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
4k 70 61281
PlexPt/chatgpt-corpus
ChatGPT 中文语料库对话语料小说语料客服语料用于训练大模型
913 7 4145
shijiebei2009/CEC-Corpus
:books:中文突发事件语料库（Chinese Emergency Corpus）-上海大学-语义智能实验室
716 30 7169
sheepzh/poetry
地球上最全的华语现代诗歌语料库，3k+诗人，80K+诗歌，15M+字
Language:Python704 9 683
gkiril/oie-resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
499 24 059
guhhhhaa/4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料
422 7 065
grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Language:Macaulay2263 12 622
guhhhhaa/wula-scifi
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料
122 2 026
NathanDuran/Switchboard-Corpus
Utilities for Processing the Switchboard Dialogue Act Corpus
Language:Python70 0 214
aplmikex/deduplication_mnbvc
文本去重
Language:Python69 2 212
dataset-vn/DANeS
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Language:Python66 3 014
clarin-eric/ParlaMint
ParlaMint: Comparable Parliamentary Corpora
Language:XSLT64 25 51253
LemonAttn/bilibili_comment_crawl
爬取bilibili视频下的评论，最新出品！！！⚠本代码只适用于学习，做其他事情概不负责！！！
Language:Python62 0 62
zonghui0228/BioMedical-NLP-corpus
Biomedical NLP Corpus or Datasets.
61 1 06
johentsch/ms3
A parser for annotated MuseScore 3 files.
Language:Python49 1 763
shijiebei2009/CEEC-Corpus
:books:中文环境突发事件语料库（Chinese Environment Emergency Corpus）-上海大学-语义智能实验室
46 2 115
hailiang-wang/egret-wenda-corpus
A Public Corpus for Machine Learning
Language:JavaScript44 7 018
KehaoWu/Jinyong-Corpus
金庸15部小说字典
43 2 016
CanCLID/canto-filter
粵文語料篩選器 Cantonese text filter
Language:Python41 8 14
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
Language:Java38 4 26
jaaack-wang/ccnc
CCNC: A Comprehensive Chinese Name Corpus (3.65M name samples). 大型中文姓名语料库 (内含365万姓名语例)。
Language:Jupyter Notebook37 1 010
PolMine/GermaParlTEI
GermaParl: Corpus of Plenary Protocols of the German Bundestag (TEI Format)
33 5 19
NathanDuran/MRDA-Corpus
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Language:Python32 0 17
ziegler-ingo/CRAFT
Code, datasets, and checkpoints for the paper "CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation"
Language:Python28 2 010
fbougares/TSAC
Tunisian Sentiment Analysis Corpus.
27 2 011
yc9701/pansori-tedxkr-corpus
Korean ASR Corpus generated from TEDx talks
27 2 04
spianmo/MultiClassify_LSTM_ForChinese
Use Bi-LSTM neural network to classify Chinese text sentiment, including eight categories (like, disgust, happiness, sadness, anger, surprise, fear, none)
Language:Python26 2 05
sagesolar/Corpus-of-Taylor-Swift
This is a dataset consisting of all song lyric words found on all of Taylor Swift's studio albums (up to and including TTPD), as well as a selection of other songs written by her.
24 1 03
undertheseanlp/corpus.viwiki
Vietnamese Wikipedia Corpus
Language:Python20 1 08
PakUrdu-Research-Center/awesome-urdu
Repository dedicated to a collection of resources and helping material for Urdu language Processing related tasks
19 4 08
PyThaiNLP/thaigov-v2-corpus
Thai News Dataset from Thai government website.
Language:Jupyter Notebook18 2 01
luciamariaalvarezcrespo/GalMisoCorpus2023
:bookmark_tabs: Galician corpus for misogyny detection
Language:Python17 2 00
maxent-ai/Datasets
datasets with text data for use in NLP, Text analysis, information extraction, ML research.
Language:Jupyter Notebook16 4 03
wiragotama/TIARA-annotationTool
An Interactive Tool for Annotating Discourse Structure and Text Improvement
Language:JavaScript16 1 03
dohliam/hawaiian-corpus
Data from a corpus of written Hawaiian
15 3 10
UIUCLearningLanguageLab/AOCHILDES
Python API for loading language data from American-English CHILDES database
Language:Python15 1 13

corpus-data

esbatmop/MNBVC

PlexPt/chatgpt-corpus

shijiebei2009/CEC-Corpus

sheepzh/poetry

gkiril/oie-resources

guhhhhaa/4675-scifi

grammarly/ua-gec

guhhhhaa/wula-scifi

NathanDuran/Switchboard-Corpus

aplmikex/deduplication_mnbvc

dataset-vn/DANeS

clarin-eric/ParlaMint

LemonAttn/bilibili_comment_crawl

zonghui0228/BioMedical-NLP-corpus

johentsch/ms3

shijiebei2009/CEEC-Corpus

hailiang-wang/egret-wenda-corpus

KehaoWu/Jinyong-Corpus

CanCLID/canto-filter

uma-pi1/OPIEC

jaaack-wang/ccnc

PolMine/GermaParlTEI

NathanDuran/MRDA-Corpus

ziegler-ingo/CRAFT

fbougares/TSAC

yc9701/pansori-tedxkr-corpus

spianmo/MultiClassify_LSTM_ForChinese

sagesolar/Corpus-of-Taylor-Swift

undertheseanlp/corpus.viwiki

PakUrdu-Research-Center/awesome-urdu

PyThaiNLP/thaigov-v2-corpus

luciamariaalvarezcrespo/GalMisoCorpus2023

maxent-ai/Datasets

wiragotama/TIARA-annotationTool

dohliam/hawaiian-corpus

UIUCLearningLanguageLab/AOCHILDES