tokenizer
There are 1086 repositories under tokenizer topic.
theseer/tokenizer
A small library for converting tokenized PHP source code into XML (and potentially other formats)
Chevrotain/chevrotain
Parser Building Toolkit for JavaScript
natasha/natasha
Solves basic Russian NLP tasks, API for lower level Natasha projects
roshan-research/hazm
Persian NLP Toolkit
lovit/soynlp
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
no-context/moo
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
ikawaha/kagome
Self-contained Japanese Morphological Analyzer written in pure Go
mathewsanders/Mustard
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
smoothnlp/SmoothNLP
专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference
dqbd/tiktokenizer
Online playground for OpenAPI tokenizers
open-korean-text/open-korean-text
Open Korean Text Processor - An Open-source Korean Text Processor
jflex-de/jflex
The fast scanner generator for Java™ with full Unicode support
glayzzle/php-parser
:herb: NodeJS PHP Parser - extract AST or tokens
alasdairforsythe/tokenmonster
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
wangfenjin/simple
支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin
lydell/js-tokens
Tiny JavaScript tokenizer.
hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
lionsoul2014/friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
CogComp/cogcomp-nlp
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
neurosnap/sentences
A multilingual command line sentence tokenizer in Golang
timtadh/lexmachine
Lex machinary for go.
niieani/gpt-tokenizer
JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
polm/fugashi
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
ku-nlp/jumanpp
Juman++ (a Morphological Analyzer Toolkit)
lindera-morphology/lindera
A multilingual morphological analysis library.
leodevbro/vscode-blockman
VSCode extension to highlight nested code blocks
belladoreai/llama-tokenizer-js
JS tokenizer for LLaMA 1 and 2
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
bitextor/bitextor
Bitextor generates translation memories from multilingual websites
artitw/text2text
Text2Text: Crosslingual NLP/G toolkit
guillaume-be/rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer