tokenization
There are 813 repositories under tokenization topic.
explosion/spaCy
💫 Industrial-strength Natural Language Processing (NLP) in Python
lunasec-io/lunasec
LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/
securitybunker/databunker
Secure SDK/vault for personal records/PII built to comply with GDPR
RavenProject/Ravencoin
Ravencoin Core integration/staging tree
VKCOM/YouTokenToMe
Unsupervised text tokenizer focused on computational efficiency
explosion/spacy-streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
AmoDinho/datacamp-python-data-science-track
All the slides, accompanying code and exercises all stored in this repo. 🎈
nlp-uoregon/trankit
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
adobe/NLP-Cube
Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
alasdairforsythe/tokenmonster
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
yooper/php-text-analysis
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language
macmade/ClangKit
ClangKit provides an Objective-C frontend to LibClang. Source tokenization, diagnostics and fix-its are actually implemented.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
WorksApplications/sudachi.rs
Sudachi in Rust 🦀 and new generation of SudachiPy
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
CodeChain-io/codechain
CodeChain's official implementation in Rust.
natasha/razdel
Rule-based token, sentence segmentation for Russian language
SmartTokenLabs/TokenScript
TokenScript schema, specs and paper
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
AgentOps-AI/tokencost
Easy token price estimates for LLMs
janlukasschroeder/nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
milaan9/Python_Natural_Language_Processing
This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.
zjukg/MyGO
[Paper][Preprint 2024] MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion
rth/vtext
Simple NLP in Rust with Python bindings
THUDM/icetk
A unified tokenization tool for Images, Chinese and English.
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
GlitchedPolygons/l8w8jwt
Minimal, OpenSSL-less and super lightweight JWT library written in C.
gautierdag/bpeasy
Fast bare-bones BPE for modern tokenizer training
lucidrains/charformer-pytorch
Implementation of the GBST block from the Charformer paper, in Pytorch
aymara/lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
mit-ccc/TweebankNLP
An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
JuliaText/WordTokenizers.jl
High performance tokenizers for natural language processing and other related tasks
bminixhofer/zett
Code for Zero-Shot Tokenizer Transfer
dluc/openai-tools
A collection of tools for working with OpenAI
taishan1994/sentencepiece_chinese_bpe
使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。