tokenization

There are 1291 repositories under tokenization topic.

sentencepiece_chinese_bpe
使用sentencepiece中BPE训练中文词表，并在transformers中进行使用。
Language:Python119
charformer-pytorch
Implementation of the GBST block from the Charformer paper, in Pytorch
Language:Python118
lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Language:C++114
tkseem
Arabic Tokenization Library. It provides many tokenization algorithms.
Language:Jupyter Notebook107
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Language:Python105
openai-tools
A collection of tools for working with OpenAI
Language:C#100
python-fpe
FPE - Format Preserving Encryption with FF3 in Python
Language:Python100
WordTokenizers.jl
High performance tokenizers for natural language processing and other related tasks
Language:Julia99
dlp-dataflow-deidentification
Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
Language:Java95
attacut
A Fast and Accurate Neural Thai Word Segmenter
Language:Python90
wisesight-sentiment
Thai social media text sentiment dataset
Language:Jupyter Notebook87
nlpcloud-python
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...
Language:Python85
klmbr
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
Language:TeX80
Coursera-DeepLearning.AI-Natural-Language-Processing-Specialization
This Repository Contains Solution to the Assignments of the Natural Language Processing Specialization from Deeplearning.ai on Coursera Taught by Younes Bensouda Mourri, Łukasz Kaiser, Eddy Shyu
Language:Jupyter Notebook80
wongnai-corpus
Collection of Wongnai's datasets
77
Real-World-Assets-RWA
This repository comprises the theoretical and technical aspects of tokenisation of real world assets.
Language:Solidity76
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Language:Jupyter Notebook73
SeTok
Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
Language:Python72
uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split words, sentences and graphemes.
Language:Go68
MBTI-Personality-Classifier
A model which uses your social media posting predict your MBTI personality type.
Language:Jupyter Notebook67
h-net-dynamic-chunking
Implementation of the dynamic chunking mechanism in H-net by Hwang et al. of Carnegie Mellon
Language:Python64
ling
Natural Language Processing Toolkit in Golang
Language:Go64
CMTAT
Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.
Language:JavaScript63
vaulty
Tokenize, encrypt/decrypt, mask your data on the fly with Vaulty proxy
Language:Go62
wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Language:JavaScript62
spacy-server
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
Language:Python60
code_tokenize
Fast tokenization and structural analysis of any programming language
Language:Python59
bert_tokenization_for_java
This is a java version of Chinese tokenization descried in BERT.
Language:Java59
contracts
On-chain RWA Tokenization Framework
Language:Solidity56
unscanny
Painless string scanning.
Language:Rust56
cookbook
The Unicode Cookbook for Linguists
Language:TeX56
FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
Language:C#50
Natural-Language-Processing-Fundamentals
Use Python and NLTK to build out your own text classifiers and solve common NLP problems
Language:Jupyter Notebook50
cashtokens
A proposal to enable two new primitives on Bitcoin Cash: fungible tokens and non-fungible tokens.
48
nlpcloud-js
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...
Language:JavaScript48
xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.
Language:Python47