corpora
There are 157 repositories under corpora topic.
juand-r/entity-recognition-datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
nltk/nltk_data
NLTK Data
piskvorky/gensim-data
Data repository for pretrained NLP models and NLP corpora.
AI4Bharat/indicnlp_catalog
A collaborative catalog of NLP resources for Indic languages
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
PlanTL-GOB-ES/lm-spanish
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
OpenCorpora/opencorpora
A web-based engine for creating and annotating textual corpora
ko-nlp/Open-korean-corpora
Open Korean NLP Dataset Curation for the Users All Around the Globe
zliucr/CrossNER
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
jfainberg/self_dialogue_corpus
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports
josecannete/spanish-corpora
Unannotated Spanish 3 Billion Words Corpora
saidziani/Arabic-News-Article-Classification
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
CanCLID/awesome-cantonese-nlp
A curated list of resources dedicated to Natural Language Processing (NLP) of Cantonese | 粵語 NLP
kgjerde/corporaexplorer
An R package for dynamic exploration of text collections
czcorpus/kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
jacklanda/CCAE
The Official Repository for 👉 CCAE: A Corpus of Chinese-based Asian Englishes @ NLPCC 2023
hu-ner/huner
Named Entity Recognition for biomedical entities
M4t1ss/parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
JuliaText/CorpusLoaders.jl
A variety of loaders for various NLP corpora.
kili-technology/awesome-datasets
A comprehensive list of annotated training datasets classified by use case.
PlanTL-GOB-ES/lm-biomedical-clinical-es
Official source for Spanish pretrained biomedical and clinical language models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
CyberZHG/wiki-dump-reader
Extract corpora from Wikipedia dumps
texttechnologylab/GerParCor
German Parliamentary Corpus (GerParCor)
digitallinguistics/data-format
The Data Format for Digital Linguistics (DaFoDiL)
dkalpakchi/awesome-swedish-nlp
A curated list of resources for natural language processing (NLP) in Swedish
Esukhia/Corpora
repo for Tibetan corpora
gambolputty/textstelle
Textstelle is a collection of corpora for the creation of bots and other things that generate text 🤖
dterg/biomedical_corpora
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
EdwardSeley/lyrics-corpora
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
WladimirSidorenko/PotTS
The Potsdam Twitter Sentiment Corpus
esantus/EVALution
Dataset containing Semantic Relations and Metadata, for Training and Evaluating Distributional Semantic Models in English and Mandarin Chinese
NetherlandsForensicInstitute/demeuk
Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings.
korenyoni/opus-api
OPUS (opus.nlpl.eu) Python3 API
filipefilardi/text-mining
Clean corpus generic script made with tm package