tokenizer

There are 1313 repositories under tokenizer topic.

  • theseer/tokenizer

    A small library for converting tokenized PHP source code into XML (and potentially other formats)

    Language:PHP5.2k8622
  • Chevrotain/chevrotain

    Parser Building Toolkit for JavaScript

    Language:TypeScript2.7k29815217
  • tiktokenizer

    dqbd/tiktokenizer

    Online playground for OpenAPI tokenizers

    Language:TypeScript1.3k1021154
  • roshan-research/hazm

    Persian NLP Toolkit

    Language:Python1.3k23243198
  • natasha

    natasha/natasha

    Solves basic Russian NLP tasks, API for lower level Natasha projects

    Language:Python1.3k5794109
  • lovit/soynlp

    한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

    Language:Python97840118185
  • ikawaha/kagome

    Self-contained Japanese Morphological Analyzer written in pure Go

    Language:Go897223756
  • no-context/moo

    Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

    Language:JavaScript864119970
  • BLKSerene/Wordless

    An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

    Language:Python734262995
  • wangfenjin/simple

    支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin

    Language:C++7316108102
  • mathewsanders/Mustard

    🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

    Language:Swift68712118
  • cbaziotis/ekphrasis

    Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

    Language:Python671182893
  • risesoft-y9/Data-Labeling

    数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。

    Language:Java66768094
  • open-korean-text/open-korean-text

    Open Korean Text Processor - An Open-source Korean Text Processor

    Language:Scala644494698
  • SmoothNLP

    smoothnlp/SmoothNLP

    专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference

    Language:Java6232034112
  • niieani/gpt-tokenizer

    The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (o1, o3, o4, gpt-4o, gpt-4, etc.). Port of OpenAI's tiktoken with additional features.

    Language:TypeScript61844846
  • jflex-de/jflex

    The fast scanner generator for Java™ with full Unicode support

    Language:Java61622335119
  • therealoliver/Deepdive-llama3-from-scratch

    Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

    Language:Jupyter Notebook6094050
  • alasdairforsythe/tokenmonster

    Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

    Language:Go600112821
  • glayzzle/php-parser

    :herb: NodeJS PHP Parser - extract AST or tokens

    Language:JavaScript5511828872
  • lindera/lindera

    A multilingual morphological analysis library.

    Language:Rust532610049
  • lydell/js-tokens

    Tiny JavaScript tokenizer.

    Language:JavaScript52461234
  • lionsoul2014/friso

    High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

    Language:C504322091
  • hplt-project/sacremoses

    Python port of Moses tokenizer, truecaser and normalizer

    Language:Python495118260
  • vscode-blockman

    leodevbro/vscode-blockman

    VSCode extension to highlight nested code blocks

    Language:TypeScript487714119
  • CogComp/cogcomp-nlp

    CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

    Language:Java47861385144
  • fugashi

    polm/fugashi

    A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

    Language:C++47578439
  • neurosnap/sentences

    A multilingual command line sentence tokenizer in Golang

    Language:Go456131738
  • timtadh/lexmachine

    Lex machinary for go.

    Language:Go412112328
  • nagisa

    taishi-i/nagisa

    A Japanese tokenizer based on recurrent neural networks

    Language:Python404103023
  • ku-nlp/jumanpp

    Juman++ (a Morphological Analyzer Toolkit)

    Language:C++3973111044
  • daac-tools/vibrato

    🎤 vibrato: Viterbi-based accelerated tokenizer

    Language:Rust37472115
  • belladoreai/llama-tokenizer-js

    JS tokenizer for LLaMA 1 and 2

    Language:JavaScript35141123
  • zurawiki/tiktoken-rs

    Ready-made tokenizer library for working with GPT and tiktoken

    Language:Rust33752462
  • guillaume-be/rust-tokenizers

    Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

    Language:Rust32671231
  • OpenNMT/Tokenizer

    Fast and customizable text tokenization library with BPE and SentencePiece support

    Language:C++316208175