corpus-linguistics

There are 326 repositories under corpus-linguistics topic.

BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Language:Python705 27 2992
JiashuWu/Books
My book list
547 15 0354
louisowen6/NLP_bahasa_resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
498 8 1132
adbar/German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
456 45 567
kmkurn/id-nlp-resource
A list of Indonesian NLP resources.
278 15 148
OpenCorpora/opencorpora
A web-based engine for creating and annotating textual corpora
Language:PHP241 28 87223
kirralabs/indonesian-NLP-resources
data resource untuk NLP bahasa indonesia
230 10 049
oroszgy/awesome-hungarian-nlp
A curated list of NLP resources for Hungarian
229 21 1418
google/corpuscrawler
Crawler for linguistic corpora
Language:Python197 21 3155
oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
Language:Rust163 2 4314
scriptin/kanji-frequency
Kanji usage frequency data collected from various sources
Language:Astro134 5 319
OliverHellwig/sanskrit
Data for the quantitative study of (Vedic) Sanskrit
Language:Python113 22 1144
oscar-project/goclassy
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Language:Go86 9 26
islamAndAi/QURAN-NLP
Quran, Hadith, Translations, Tafaseer, Corpus Linguistics. Everything for NLP
Language:Jupyter Notebook68 4 713
czcorpus/kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
Language:TypeScript63 11 1.3k22
natasha/nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
Language:Python63 7 410
JonathanReeve/corpus-db
A textual corpus database for the digital humanities.
Language:Jupyter Notebook60 8 329
lennes/spect
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
Language:HTML57 5 211
LanguageMachines/PICCL
A set of workflows for corpus building through OCR, post-correction and normalisation
Language:Python48 8 646
STRZGR/Natural-Language-Processing-with-Python-Analyzing-Text-with-the-Natural-Language-Toolkit
My solutions to selected exercises to "Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit" by Steven Bird, Ewan Klein, and Edward Loper.
Language:Jupyter Notebook48 4 035
kbatsuren/CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
45 8 210
MarsPanther/Amharic-English-Machine-Translation-Corpus
Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.
Language:Python40 6 025
jaaack-wang/Chinese-Synonyms
A large high-quality corpus of Chinese synonyms 一个大型、高质量的中文同义词语料库。
Language:Jupyter Notebook39 1 04
johnwdubois/rezonator
Rezonator: Dynamics of human engagement
Language:Yacc33 4 1.5k2
interrogator/conll-df
CONLL-U to Pandas DataFrame
Language:Python31 1 29
praaline/Praaline
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
Language:C27 2 15
digitallinguistics/data-format
The Data Format for Digital Linguistics (DaFoDiL)
Language:JavaScript22 5 3120
notesjor/corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Language:C#22 5 73
mshakirDr/MFTE
MFTE (Multi Feature Tagger of English) Python is the Python version based on Le Foll's MFTE written in Perl. It is extended to include semantic tags from Biber (2006) and Biber et al. (1999), including other specific tags.
Language:HTML21 1 32
KMCS-NII/AASC
AASC: ACL Anthology Sentence Corpus
Language:Perl20 9 02
PyThaiNLP/thai-law
Thai Law Dataset (Act of Parliament)
Language:Jupyter Notebook20 8 15
timarkh/tsakorpus
Yet another search platform for linguistic corpora.
Language:Python20 3 1715
undertheseanlp/corpus.viwiki
Vietnamese Wikipedia Corpus
Language:Python19 2 08
dterg/biomedical_corpora
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
18 6 04
EdwardSeley/lyrics-corpora
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Language:Python18 4 11
cisnlp/GlotCC
GlotCC Dataset and Pipline -- NeurIPS 2024
Language:Jupyter Notebook17 9 00

corpus-linguistics

BLKSerene/Wordless

JiashuWu/Books

louisowen6/NLP_bahasa_resources

adbar/German-NLP

kmkurn/id-nlp-resource

OpenCorpora/opencorpora

kirralabs/indonesian-NLP-resources

oroszgy/awesome-hungarian-nlp

google/corpuscrawler

oscar-project/ungoliant

scriptin/kanji-frequency

OliverHellwig/sanskrit

oscar-project/goclassy

islamAndAi/QURAN-NLP

czcorpus/kontext

natasha/nerus

JonathanReeve/corpus-db

lennes/spect

LanguageMachines/PICCL

STRZGR/Natural-Language-Processing-with-Python-Analyzing-Text-with-the-Natural-Language-Toolkit

kbatsuren/CogNet

MarsPanther/Amharic-English-Machine-Translation-Corpus

jaaack-wang/Chinese-Synonyms

johnwdubois/rezonator

interrogator/conll-df

praaline/Praaline

digitallinguistics/data-format

notesjor/corpusexplorer2.0

mshakirDr/MFTE

KMCS-NII/AASC

PyThaiNLP/thai-law

timarkh/tsakorpus

undertheseanlp/corpus.viwiki

dterg/biomedical_corpora

EdwardSeley/lyrics-corpora

cisnlp/GlotCC