text-processing
There are 1964 repositories under text-processing topic.
learnbyexample/Command-line-text-processing
:zap: From finding text to search and replace, from sorting to beautifying text and more :art:
pymupdf/PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
google/diff-match-patch
Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
chmln/sd
Intuitive find & replace CLI (sed alternative)
fastnlp/fastNLP
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
pyparsing/pyparsing
Python library for creating PEG parsers
kk7nc/Text_Classification
Text Classification Algorithms: A Survey
roshan-research/hazm
Persian NLP Toolkit
pemistahl/lingua-go
The most accurate natural language detection library for Go, suitable for short text and mixed-language text
helix-editor/nucleo
A fast and convenient fuzzy matcher library for rust
birchb1024/frangipanni
Program to convert lines of text into a tree structure.
BurntSushi/aho-corasick
A fast implementation of Aho-Corasick in Rust.
PyThaiNLP/pythainlp
Thai natural language processing in Python
sstadick/hck
A sharp cut(1) clone.
ChenghaoMou/text-dedup
All-in-one text de-duplication
derek73/python-nameparser
A simple Python module for parsing human names into their individual components
cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
abadojack/whatlanggo
Natural language detection library for Go
wenet-e2e/WeTextProcessing
Text Normalization & Inverse Text Normalization
open-korean-text/open-korean-text
Open Korean Text Processor - An Open-source Korean Text Processor
lukaszliniewicz/Pandrator
Turn PDFs and EPUBs into audiobooks, subtitles or videos into dubbed videos (including translation), and more. For free. Pandrator uses local models, notably XTTS, including voice-cloning (instant, RVC-enhanced, XTTS fine-tuning) and LLM processing. It aspires to be a user-friendly app with a GUI, an installer and all-in-one packages.
Puchaczov/Musoq
SQL Syntax without any database
proycon/pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
linuxscout/pyarabic
pyarabic
haven-jeon/PyKoSpacing
Automatic Korean word spacing with Python
andrewbihl/bsed
Simple SQL-like syntax on top of Perl text processing.
airbnb/artificial-adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
BurntSushi/regex-automata
A low level regular expression library that uses deterministic finite automata.
ikegami-yukino/jaconv
Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
gagolews/stringi
Fast and portable character string processing in R (with the Unicode ICU)
textpipe/textpipe
Textpipe: clean and extract metadata from text
RandyPen/TextCluster
短文本聚类预处理模块 Short text cluster
himkt/konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
open-i18n/rust-unic
UNIC: Unicode and Internationalization Crates for Rust
Goldziher/html-to-markdown
HTML to markdown converter
daac-tools/daachorse
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.