sentencepiece
There are 50 repositories under sentencepiece topic.
niedev/RTranslator
Open source real-time translation app for Android that runs locally
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
himkt/konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
taishan1994/sentencepiece_chinese_bpe
使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。
lingvanex-mt/models
Free and open source pre-trained translation models, including Kurdish, Samoan, Xhosa, Lao, Corsican, Cebuano, Galician, Russian, Belarusian and Yoruba.
dhpollack/huggingface_libtorch
Minimal example of using a traced huggingface transformers model with libtorch
eliben/go-sentencepiece
Go implementation of the SentencePiece tokenizer
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.
nguyenvulebinh/vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
bnosac/sentencepiece
R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece
danieldk/sentencepiece
Rust binding for the sentencepiece library
Andras7/gpt2-pytorch
Extremely simple and understandable GPT2 implementation with minor tweaks
stephantul/piecelearn
Learning BPE embeddings by first learning a segmentation model and then training word2vec
jkrukowski/swift-sentencepiece
Use SentencePiece in Swift for tokenization and detokenization.
sctg-development/sentencepiece-js
sentencepiece port to webassembly with browser compatibility
to-aoki/my-pytorch-bert
BERT implementation of PyTorch
Masao-Taketani/japanese_text_classification
To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
NishantkSingh0/Generative-Language-Model
Trained Decoder only model on large BookCorpus Dataset. First time!
wang1ang/SentencePieceWrapper
sentencepiece C# wrapper
leliuga/datrin
dataset, train, inference
smafjal/bengali_tokenizer
Bengali language Tokenizer (SentencePiece)
kgarg8/NMT-RNN
NMT with RNN Models: (1) in Vanilla style, (2) with Sentencepiece, (3) using Pre-trained models from FairSeq
twinnydotdev/toxe
SentencePiece tokenizer for cross-encoders
arusl/anlp_nlp2021_d3-1
This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification" presented at https://www.anlp.jp/nlp2021/. Authors: Andre Rusli and Makoto Shishido (Tokyo Denki University).
Daniel-Heo/NemoTokenizer
Fast wordpiece, sentencepiece tokenizer by Trie, OpenMP, SIMD, MemoryPool
Doarakko/vector-text-similarity-search
Search for similar documents using Elasticsearch and BERT.
paul-souvik3/ner-multilingual-app
Multilingual NER app using XLM-RoBERTa and Gradio.
Sid911/sentencepiece_dart
Sentencepiece Dart is a wrapper for Google's Sentencepiece C++ library modified
Abhigyan126/SentencePiece-Tokenisation
A python and rust implementation of SentencePiece (A language-independent subword tokeniser and de-tokeniser developed by Google)
burcgokden/SentencePiece-Tokenizer-Wrapper-for-PLDR-LLM-KVG-cache
SentencePiece Tokenizer Wrapper implementation for PLDR-LLM with KV cache and G-cache
lashebir/de-en-translator
German to English translator using a Seq2Seq transformer
mahdertesf/SentencePiece-and-Byte-Pair-Encoding-BPE-Implementation
This repository provides a hands-on exploration of SentencePiece tokenization and Byte-Pair Encoding (BPE) .The code demonstrates data preprocessing steps like NFKC normalization and lossless tokenization, followed by a practical implementation of the BPE algorithm from scratch.
mddanish00/temp-sentencepiece-build
Temporary repo for building 3.13 wheel for sentencepiece until new version came out.
NikitaGoldashevsky/NMT-Transformer
A TensorFlow-based Transformer model for English-to-Russian neural machine translation. Features subword tokenization with SentencePiece, hyperparameter optimization via genetic algorithm, and a Flask web interface for real-time translations.