sentencepiece
There are 38 repositories under sentencepiece topic.
niedev/RTranslator
Open source real-time translation app for Android that runs locally
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
himkt/konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
taishan1994/sentencepiece_chinese_bpe
使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。
dhpollack/huggingface_libtorch
Minimal example of using a traced huggingface transformers model with libtorch
nguyenvulebinh/vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
bnosac/sentencepiece
R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece
eliben/go-sentencepiece
Go implementation of the SentencePiece tokenizer
Andras7/gpt2-pytorch
Extremely simple and understandable GPT2 implementation with minor tweaks
danieldk/sentencepiece
Rust binding for the sentencepiece library
stephantul/piecelearn
Learning BPE embeddings by first learning a segmentation model and then training word2vec
sctg-development/sentencepiece-js
sentencepiece port to webassembly with browser compatibility
to-aoki/my-pytorch-bert
BERT implementation of PyTorch
Masao-Taketani/japanese_text_classification
To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
Systemcluster/kitoken
Fast and versatile tokenizer for language models with BPE, Unigram and WordPiece tokenization. Compatible with SentencePiece, Tokenizers, Tiktoken and more.
wang1ang/SentencePieceWrapper
sentencepiece C# wrapper
leliuga/datrin
dataset, train, inference
smafjal/bengali_tokenizer
Bengali language Tokenizer (SentencePiece)
kgarg8/NMT-RNN
NMT with RNN Models: (1) in Vanilla style, (2) with Sentencepiece, (3) using Pre-trained models from FairSeq
twinnydotdev/toxe
SentencePiece tokenizer for cross-encoders
arusl/anlp_nlp2021_d3-1
This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification" presented at https://www.anlp.jp/nlp2021/. Authors: Andre Rusli and Makoto Shishido (Tokyo Denki University).
Doarakko/vector-text-similarity-search
Search for similar documents using Elasticsearch and BERT.
Sid911/sentencepiece_dart
Sentencepiece Dart is a wrapper for Google's Sentencepiece C++ library modified
burcgokden/Sentencepiece-Tokenizer-Wrapper-for-PLDR-LLM
A framework for building Sentencepiece tokenizer from a dataset
kmaurinjones/WikiGameBot
Automated WikiGame-playing 'bot'. Achieved via SentenceTransformer Word Embeddings.
lingvanex-mt/models
This repository contains pre-trained translation models, including Kurdish, Samoan, Xhosa, Lao, Corsican, Cebuano, Galician, Yiddish, Swahili, and Yoruba.
ReshiAdavan/Thoth
An Industry Standard Tokenizer, purposed for large-scale language models like OpenAI's GPT Series.
FloweryK/Sentencepiece-Pretrained-Models
pretrained models and a training code for sentencepiece
jayden5744/NMT_Korean_To_English
한글을 영어로 번역하는 자연어처리 모델 스터디입니다.
rafael-vasconcellos/sugoi-v4-space
A huggingface space for Sugoi V4
Sid911/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
Systemcluster/sentencepiece-model
SentencePiece model parser generated from the SentencePiece protobuf definition.
ZJaume/escape-unk
Escape unknown symbols in SentecePiece vocabularies