tokenizer

There are 1518 repositories under tokenizer topic.

theseer/tokenizer
A small library for converting tokenized PHP source code into XML (and potentially other formats)
Language:PHP5.2k 6 823
Chevrotain/chevrotain
Parser Building Toolkit for JavaScript
Language:TypeScript2.7k 28 816217
dqbd/tiktokenizer
Online playground for OpenAPI tokenizers
Language:TypeScript1.4k 11 22158
roshan-research/hazm
Persian NLP Toolkit
Language:Python1.3k 23 245202
natasha/natasha
Solves basic Russian NLP tasks, API for lower level Natasha projects
Language:Python1.3k 56 94110
lovit/soynlp
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
Language:Python981 40 119183
ikawaha/kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Language:Go910 22 3756
no-context/moo
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
Language:JavaScript871 10 10072
wangfenjin/simple
支持中文和拼音的 SQLite fts5 全文搜索扩展｜ A SQLite3 fts5 tokenizer which supports Chinese and PinYin
Language:C++747 5 116105
BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Language:Python742 27 2995
risesoft-y9/Data-Labeling
数据标注是一款专门对文本数据进行处理和标注的工具，通过简化快捷的文本标注流程和动态的算法反馈，支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础，再由自动标注反哺人工标注，最后由人工标注进行纠偏，从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。
Language:Java690 67 0102
mathewsanders/Mustard
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Language:Swift687 12 118
cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Language:Python672 16 2895
niieani/gpt-tokenizer
The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (gpt-5, gpt-o*, gpt-4o, etc.). Port of OpenAI's tiktoken with additional features.
Language:TypeScript669 3 4949
open-korean-text/open-korean-text
Open Korean Text Processor - An Open-source Korean Text Processor
Language:Scala649 50 4697
jflex-de/jflex
The fast scanner generator for Java™ with full Unicode support
Language:Java624 21 335120
smoothnlp/SmoothNLP
专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference
Language:Java621 20 34112
therealoliver/Deepdive-llama3-from-scratch
Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.
Language:Jupyter Notebook609 4 050
alasdairforsythe/tokenmonster
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
Language:Go604 11 2919
glayzzle/php-parser
:herb: NodeJS PHP Parser - extract AST or tokens
Language:JavaScript554 18 29372
lindera/lindera
A multilingual morphological analysis library.
Language:Rust539 6 10649
lydell/js-tokens
Tiny JavaScript tokenizer.
Language:JavaScript530 6 1335
lionsoul2014/friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Language:C504 30 2094
hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
Language:Python495 11 8260
leodevbro/vscode-blockman
VSCode extension to highlight nested code blocks
Language:TypeScript490 7 14920
polm/fugashi
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
Language:C++485 6 8539
CogComp/cogcomp-nlp
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Language:Java479 59 385144
NLPOptimize/flash-tokenizer
EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING
Language:C++4687
neurosnap/sentences
A multilingual command line sentence tokenizer in Golang
Language:Go458 11 1740
FoundationVision/UniTok
[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding
Language:Python441 8 2410
timtadh/lexmachine
Lex machinary for go.
Language:Go410 9 2328
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
Language:Python408 10 3023
ku-nlp/jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Language:C++399 31 11146
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
Language:Rust384 7 2221
belladoreai/llama-tokenizer-js
JS tokenizer for LLaMA 1 and 2
Language:JavaScript360 4 1124
zurawiki/tiktoken-rs
Ready-made tokenizer library for working with GPT and tiktoken
Language:Rust346 6 2565

tokenizer

theseer/tokenizer

Chevrotain/chevrotain

dqbd/tiktokenizer

roshan-research/hazm

natasha/natasha

lovit/soynlp

ikawaha/kagome

no-context/moo

wangfenjin/simple

BLKSerene/Wordless

risesoft-y9/Data-Labeling

mathewsanders/Mustard

cbaziotis/ekphrasis

niieani/gpt-tokenizer

open-korean-text/open-korean-text

jflex-de/jflex

smoothnlp/SmoothNLP

therealoliver/Deepdive-llama3-from-scratch

alasdairforsythe/tokenmonster

glayzzle/php-parser

lindera/lindera

lydell/js-tokens

lionsoul2014/friso

hplt-project/sacremoses

leodevbro/vscode-blockman

polm/fugashi

CogComp/cogcomp-nlp

NLPOptimize/flash-tokenizer

neurosnap/sentences

FoundationVision/UniTok

timtadh/lexmachine

taishi-i/nagisa

ku-nlp/jumanpp

daac-tools/vibrato

belladoreai/llama-tokenizer-js

zurawiki/tiktoken-rs