corpus-linguistics
There are 326 repositories under corpus-linguistics topic.
BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
JiashuWu/Books
My book list
louisowen6/NLP_bahasa_resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
adbar/German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
kmkurn/id-nlp-resource
A list of Indonesian NLP resources.
OpenCorpora/opencorpora
A web-based engine for creating and annotating textual corpora
kirralabs/indonesian-NLP-resources
data resource untuk NLP bahasa indonesia
oroszgy/awesome-hungarian-nlp
A curated list of NLP resources for Hungarian
google/corpuscrawler
Crawler for linguistic corpora
oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
scriptin/kanji-frequency
Kanji usage frequency data collected from various sources
OliverHellwig/sanskrit
Data for the quantitative study of (Vedic) Sanskrit
oscar-project/goclassy
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
islamAndAi/QURAN-NLP
Quran, Hadith, Translations, Tafaseer, Corpus Linguistics. Everything for NLP
czcorpus/kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
natasha/nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
JonathanReeve/corpus-db
A textual corpus database for the digital humanities.
lennes/spect
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
LanguageMachines/PICCL
A set of workflows for corpus building through OCR, post-correction and normalisation
STRZGR/Natural-Language-Processing-with-Python-Analyzing-Text-with-the-Natural-Language-Toolkit
My solutions to selected exercises to "Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit" by Steven Bird, Ewan Klein, and Edward Loper.
kbatsuren/CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
MarsPanther/Amharic-English-Machine-Translation-Corpus
Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.
jaaack-wang/Chinese-Synonyms
A large high-quality corpus of Chinese synonyms 一个大型、高质量的中文同义词语料库。
johnwdubois/rezonator
Rezonator: Dynamics of human engagement
interrogator/conll-df
CONLL-U to Pandas DataFrame
praaline/Praaline
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
digitallinguistics/data-format
The Data Format for Digital Linguistics (DaFoDiL)
notesjor/corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
mshakirDr/MFTE
MFTE (Multi Feature Tagger of English) Python is the Python version based on Le Foll's MFTE written in Perl. It is extended to include semantic tags from Biber (2006) and Biber et al. (1999), including other specific tags.
KMCS-NII/AASC
AASC: ACL Anthology Sentence Corpus
PyThaiNLP/thai-law
Thai Law Dataset (Act of Parliament)
timarkh/tsakorpus
Yet another search platform for linguistic corpora.
undertheseanlp/corpus.viwiki
Vietnamese Wikipedia Corpus
dterg/biomedical_corpora
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
EdwardSeley/lyrics-corpora
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
cisnlp/GlotCC
GlotCC Dataset and Pipline -- NeurIPS 2024