corpus-tools
There are 97 repositories under corpus-tools topic.
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
flairNLP/fundus
A very simple news crawler with a funny name
bitextor/bitextor
Bitextor generates translation memories from multilingual websites
grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
ynop/audiomate
Python library for handling audio datasets.
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
NathanDuran/Switchboard-Corpus
Utilities for Processing the Switchboard Dialogue Act Corpus
czcorpus/kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
koskenni/beta
An open source reimplementation of Benny Brodda's BETA in Python
lennes/spect
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
nickduran/align-linguistic-alignment
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
johentsch/ms3
A parser for annotated MuseScore 3 files.
LanguageMachines/PICCL
A set of workflows for corpus building through OCR, post-correction and normalisation
silenterus/deepspeech-cleaner
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
M4t1ss/parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
johnwdubois/rezonator
Rezonator: Dynamics of human engagement
NathanDuran/MRDA-Corpus
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
praaline/Praaline
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
mshakirDr/MFTE
MFTE (Multi Feature Tagger of English) Python is the Python version based on Le Foll's MFTE written in Perl. It is extended to include semantic tags from Biber (2006) and Biber et al. (1999), including other specific tags.
carlfm01/librivox-tools
Collector and speech cutter for librivox audiobooks
timarkh/tsakorpus
Yet another search platform for linguistic corpora.
jaytimm/corpuslingr
A library of functions enabling complex corpus search in context (KWIC), search aggregation, bag-of-words building & keyphrase extraction.
liao961120/concordancer
Searching in-memory corpus with Corpus Query Language (CQL)
EdwardSeley/lyrics-corpora
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
infraling/atomic
Software for multi-level annotation of linguistic corpora
wiragotama/TIARA-annotationTool
An Interactive Tool for Annotating Discourse Structure and Text Improvement
jonathandunn/corpus_similarity
Measure the similarity of text corpora for 74 languages
jonathandunn/common_crawl_corpus
Scripts for building a geo-located web corpus using Common Crawl data
Linguista/CQPweb-Instabox
Script that sets up and configures an entire CQPweb server installation
mikahama/python-korp
Library for Python to use Korp API
elenlefoll/MultiFeatureTaggerEnglish
A Multi-Feature Tagger of English originally designed for multi-feature/multi-dimensional analysis (MDA) (Biber 1988; 1995) of situational variation in standard written and spoken English
NathanDuran/Maptask-Corpus
Utilities for Processing the HCRC Map Task Corpus