corpora

There are 157 repositories under corpora topic.

nltk/nltk_data
NLTK Data
Language:Python1.5k 44 1361.1k
juand-r/entity-recognition-datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Language:Python1.5k 41 13247
piskvorky/gensim-data
Data repository for pretrained NLP models and NLP corpora.
Language:Python994 39 43135
nonamestreet/weixin_public_corpus
微信公众号语料库
574 35 7166
AI4Bharat/indicnlp_catalog
A collaborative catalog of NLP resources for Indic languages
565 34 21981
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
Language:Jupyter Notebook287 18 7821
PlanTL-GOB-ES/lm-spanish
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
Language:Python253 28 521
OpenCorpora/opencorpora
A web-based engine for creating and annotating textual corpora
Language:PHP241 28 87223
ko-nlp/Open-korean-corpora
Open Korean NLP Dataset Curation for the Users All Around the Globe
142 10 511
zliucr/CrossNER
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
Language:Python126 4 1026
jfainberg/self_dialogue_corpus
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports
Language:Python106 12 125
josecannete/spanish-corpora
Unannotated Spanish 3 Billion Words Corpora
Language:Python93 4 110
saidziani/Arabic-News-Article-Classification
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
Language:Python91 6 724
CanCLID/awesome-cantonese-nlp
A curated list of resources dedicated to Natural Language Processing (NLP) of Cantonese | 粵語 NLP
85 7 04
kgjerde/corporaexplorer
An R package for dynamic exploration of text collections
Language:R64 7 294
czcorpus/kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
Language:TypeScript63 11 1.3k22
jacklanda/CCAE
The Official Repository for 👉 CCAE: A Corpus of Chinese-based Asian Englishes @ NLPCC 2023
Language:Python59 1 03
hu-ner/huner
Named Entity Recognition for biomedical entities
Language:Python47 9 2211
M4t1ss/parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Language:PHP41 5 517
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
Language:Java36 5 26
JuliaText/CorpusLoaders.jl
A variety of loaders for various NLP corpora.
Language:Julia32 6 2213
kili-technology/awesome-datasets
A comprehensive list of annotated training datasets classified by use case.
31 3 06
PlanTL-GOB-ES/lm-biomedical-clinical-es
Official source for Spanish pretrained biomedical and clinical language models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
Language:Python26 6 42
CyberZHG/wiki-dump-reader
Extract corpora from Wikipedia dumps
Language:Python25 4 27
texttechnologylab/GerParCor
German Parliamentary Corpus (GerParCor)
Language:Java23 3 27
digitallinguistics/data-format
The Data Format for Digital Linguistics (DaFoDiL)
Language:JavaScript22 5 3120
dkalpakchi/awesome-swedish-nlp
A curated list of resources for natural language processing (NLP) in Swedish
22 3 02
Esukhia/Corpora
repo for Tibetan corpora
Language:Python21 9 22
EdwardSeley/lyrics-corpora
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Language:Python19 4 11
gambolputty/textstelle
Textstelle is a collection of corpora for the creation of bots and other things that generate text 🤖
19 3 03
dterg/biomedical_corpora
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
18 6 04
WladimirSidorenko/PotTS
The Potsdam Twitter Sentiment Corpus
Language:Python17 7 34
esantus/EVALution
Dataset containing Semantic Relations and Metadata, for Training and Evaluating Distributional Semantic Models in English and Mandarin Chinese
16 5 06
NetherlandsForensicInstitute/demeuk
Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings.
Language:Python16 21 154
korenyoni/opus-api
OPUS (opus.nlpl.eu) Python3 API
Language:Python14 2 55
filipefilardi/text-mining
Clean corpus generic script made with tm package
Language:R13 3 00

corpora

nltk/nltk_data

juand-r/entity-recognition-datasets

piskvorky/gensim-data

nonamestreet/weixin_public_corpus

AI4Bharat/indicnlp_catalog

natasha/corus

PlanTL-GOB-ES/lm-spanish

OpenCorpora/opencorpora

ko-nlp/Open-korean-corpora

zliucr/CrossNER

jfainberg/self_dialogue_corpus

josecannete/spanish-corpora

saidziani/Arabic-News-Article-Classification

CanCLID/awesome-cantonese-nlp

kgjerde/corporaexplorer

czcorpus/kontext

jacklanda/CCAE

hu-ner/huner

M4t1ss/parallel-corpora-tools

uma-pi1/OPIEC

JuliaText/CorpusLoaders.jl

kili-technology/awesome-datasets

PlanTL-GOB-ES/lm-biomedical-clinical-es

CyberZHG/wiki-dump-reader

texttechnologylab/GerParCor

digitallinguistics/data-format

dkalpakchi/awesome-swedish-nlp

Esukhia/Corpora

EdwardSeley/lyrics-corpora

gambolputty/textstelle

dterg/biomedical_corpora

WladimirSidorenko/PotTS

esantus/EVALution

NetherlandsForensicInstitute/demeuk

korenyoni/opus-api

filipefilardi/text-mining