tokenization
There are 1291 repositories under tokenization topic.
sentencepiece_chinese_bpe
使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。
charformer-pytorch
Implementation of the GBST block from the Charformer paper, in Pytorch
lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
tkseem
Arabic Tokenization Library. It provides many tokenization algorithms.
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
openai-tools
A collection of tools for working with OpenAI
python-fpe
FPE - Format Preserving Encryption with FF3 in Python
WordTokenizers.jl
High performance tokenizers for natural language processing and other related tasks
dlp-dataflow-deidentification
Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
attacut
A Fast and Accurate Neural Thai Word Segmenter
wisesight-sentiment
Thai social media text sentiment dataset
nlpcloud-python
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...
klmbr
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
Coursera-DeepLearning.AI-Natural-Language-Processing-Specialization
This Repository Contains Solution to the Assignments of the Natural Language Processing Specialization from Deeplearning.ai on Coursera Taught by Younes Bensouda Mourri, Łukasz Kaiser, Eddy Shyu
wongnai-corpus
Collection of Wongnai's datasets
Real-World-Assets-RWA
This repository comprises the theoretical and technical aspects of tokenisation of real world assets.
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
SeTok
Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split words, sentences and graphemes.
MBTI-Personality-Classifier
A model which uses your social media posting predict your MBTI personality type.
h-net-dynamic-chunking
Implementation of the dynamic chunking mechanism in H-net by Hwang et al. of Carnegie Mellon
ling
Natural Language Processing Toolkit in Golang
CMTAT
Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.
vaulty
Tokenize, encrypt/decrypt, mask your data on the fly with Vaulty proxy
wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
spacy-server
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
code_tokenize
Fast tokenization and structural analysis of any programming language
bert_tokenization_for_java
This is a java version of Chinese tokenization descried in BERT.
contracts
On-chain RWA Tokenization Framework
unscanny
Painless string scanning.
cookbook
The Unicode Cookbook for Linguists
FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
Natural-Language-Processing-Fundamentals
Use Python and NLTK to build out your own text classifiers and solve common NLP problems
cashtokens
A proposal to enable two new primitives on Bitcoin Cash: fungible tokens and non-fungible tokens.
nlpcloud-js
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...
xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.