sentencepiece

There are 50 repositories under sentencepiece topic.

niedev/RTranslator
Open source real-time translation app for Android that runs locally
Language:C++9.2k 67 119821
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Language:C++316 20 8175
himkt/konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Language:Python254 6 4027
taishan1994/sentencepiece_chinese_bpe
使用sentencepiece中BPE训练中文词表，并在transformers中进行使用。
Language:Python119 1 418
lingvanex-mt/models
Free and open source pre-trained translation models, including Kurdish, Samoan, Xhosa, Lao, Corsican, Cebuano, Galician, Russian, Belarusian and Yoruba.
87 1 00
dhpollack/huggingface_libtorch
Minimal example of using a traced huggingface transformers model with libtorch
Language:C++36 1 113
eliben/go-sentencepiece
Go implementation of the SentencePiece tokenizer
Language:Go34 1 62
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.
Language:Rust33 1 00
nguyenvulebinh/vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
Language:Python32 3 16
bnosac/sentencepiece
R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece
Language:C++26 5 76
danieldk/sentencepiece
Rust binding for the sentencepiece library
Language:Rust22 2 86
Andras7/gpt2-pytorch
Extremely simple and understandable GPT2 implementation with minor tweaks
Language:Python21 2 13
stephantul/piecelearn
Learning BPE embeddings by first learning a segmentation model and then training word2vec
Language:Python19 0 31
jkrukowski/swift-sentencepiece
Use SentencePiece in Swift for tokenization and detokenization.
Language:Swift15 1 02
sctg-development/sentencepiece-js
sentencepiece port to webassembly with browser compatibility
Language:TypeScript13 1 01
to-aoki/my-pytorch-bert
BERT implementation of PyTorch
Language:Python11 1 14
Masao-Taketani/japanese_text_classification
To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
Language:Jupyter Notebook9 0 03
NishantkSingh0/Generative-Language-Model
Trained Decoder only model on large BookCorpus Dataset. First time!
Language:Jupyter Notebook7 1 02
wang1ang/SentencePieceWrapper
sentencepiece C# wrapper
Language:C++6 2 21
leliuga/datrin
dataset, train, inference
Language:Python4 1 00
smafjal/bengali_tokenizer
Bengali language Tokenizer (SentencePiece)
Language:Python4 1 01
kgarg8/NMT-RNN
NMT with RNN Models: (1) in Vanilla style, (2) with Sentencepiece, (3) using Pre-trained models from FairSeq
Language:Python2 1 00
twinnydotdev/toxe
SentencePiece tokenizer for cross-encoders
Language:JavaScript2 1 01
arusl/anlp_nlp2021_d3-1
This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification" presented at https://www.anlp.jp/nlp2021/. Authors: Andre Rusli and Makoto Shishido (Tokyo Denki University).
Language:Jupyter Notebook1 0 01
Daniel-Heo/NemoTokenizer
Fast wordpiece, sentencepiece tokenizer by Trie, OpenMP, SIMD, MemoryPool
Language:C++1
Doarakko/vector-text-similarity-search
Search for similar documents using Elasticsearch and BERT.
Language:Jupyter Notebook1 2 21
evan176/sentencepiecego
Language:Go1 1 22
paul-souvik3/ner-multilingual-app
Multilingual NER app using XLM-RoBERTa and Gradio.
Language:Python1
sftblw/spm_jamo_tsv
Language:JavaScript1 0 0
Sid911/sentencepiece_dart
Sentencepiece Dart is a wrapper for Google's Sentencepiece C++ library modified
Language:C++1 1 02
Abhigyan126/SentencePiece-Tokenisation
A python and rust implementation of SentencePiece (A language-independent subword tokeniser and de-tokeniser developed by Google)
Language:Rust
burcgokden/SentencePiece-Tokenizer-Wrapper-for-PLDR-LLM-KVG-cache
SentencePiece Tokenizer Wrapper implementation for PLDR-LLM with KV cache and G-cache
Language:Python1 0
lashebir/de-en-translator
German to English translator using a Seq2Seq transformer
Language:Jupyter Notebook
mahdertesf/SentencePiece-and-Byte-Pair-Encoding-BPE-Implementation
This repository provides a hands-on exploration of SentencePiece tokenization and Byte-Pair Encoding (BPE) .The code demonstrates data preprocessing steps like NFKC normalization and lossless tokenization, followed by a practical implementation of the BPE algorithm from scratch.
Language:Jupyter Notebook
mddanish00/temp-sentencepiece-build
Temporary repo for building 3.13 wheel for sentencepiece until new version came out.
NikitaGoldashevsky/NMT-Transformer
A TensorFlow-based Transformer model for English-to-Russian neural machine translation. Features subword tokenization with SentencePiece, hyperparameter optimization via genetic algorithm, and a Flask web interface for real-time translations.
Language:Python

sentencepiece

niedev/RTranslator

OpenNMT/Tokenizer

himkt/konoha

taishan1994/sentencepiece_chinese_bpe

lingvanex-mt/models

dhpollack/huggingface_libtorch

eliben/go-sentencepiece

Systemcluster/kitoken

nguyenvulebinh/vietnamese-roberta

bnosac/sentencepiece

danieldk/sentencepiece

Andras7/gpt2-pytorch

stephantul/piecelearn

jkrukowski/swift-sentencepiece

sctg-development/sentencepiece-js

to-aoki/my-pytorch-bert

Masao-Taketani/japanese_text_classification

NishantkSingh0/Generative-Language-Model

wang1ang/SentencePieceWrapper

leliuga/datrin

smafjal/bengali_tokenizer

kgarg8/NMT-RNN

twinnydotdev/toxe

arusl/anlp_nlp2021_d3-1

Daniel-Heo/NemoTokenizer

Doarakko/vector-text-similarity-search

evan176/sentencepiecego

paul-souvik3/ner-multilingual-app

sftblw/spm_jamo_tsv

Sid911/sentencepiece_dart

Abhigyan126/SentencePiece-Tokenisation

burcgokden/SentencePiece-Tokenizer-Wrapper-for-PLDR-LLM-KVG-cache

lashebir/de-en-translator

mahdertesf/SentencePiece-and-Byte-Pair-Encoding-BPE-Implementation

mddanish00/temp-sentencepiece-build

NikitaGoldashevsky/NMT-Transformer