google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

C++Apache-2.0

Issues

logprobs in the vocabulary file do not match the values computed from the tokenized training document
#1050 opened 16 days ago by pnugues
0
Training with a custom base vocabulary and handling reserved tokens
#1052 opened 15 days ago by rteehas
0
Crashes on out of range inputs depending on other inputs
#1051 opened 15 days ago by colehaus
0
With unigram algorithm, constant piece at end of each sentences does not become a token
#1047 opened a month ago by jogardi
0
builds for android devices
#1045 opened a month ago by RaoufiTech
0
Error Attribute Error: type object 'SentencePieceTrainer' has no attribute 'train'. Did you mean: 'Train'?
#1046 opened a month ago by bop578530
0
decode token one by one
#1044 opened a month ago by nigelzzz
1
decode one by one can't show space
#1043 opened a month ago by nigelzzz
2
Why is the Hugging Face encoding 1 greater compared to the Google SentencePiece encoding when using the XLM-RoBERTa SentencePiece tokenizer?
#1042 opened a month ago by RaoufiTech
2
Runtime error on iOS
#1010 opened a month ago by l3utterfly
11
pip subprocess to install build dependencies did not run successfully. │ exit code: 1
#989 opened a month ago by Anubiiss
2
multi-thread batch encode seems slower than list comprehension
#1039 opened a month ago by Mr-Grin
1
Getting requirements to build wheel did not run successfully.
#971 opened 8 months ago by sapatmohit
7
trainer_interface.cc: Integer value -1 is outside the valid range of values [0, 255] for the enumeration type 'ScriptType'
#1028 opened 3 months ago by kcoul
1
No typings in Python package
#1030 opened 3 months ago by marcospgp
0
Zero Width Joiner issue for Sinhala Language
#1031 opened 3 months ago by Nadil-K
0
When I set SPM_PROTOBUF_PROVIDER to "package" in CMakeLists.txt, the compilation fails.
#1029 opened 3 months ago by hhxdestiny
0
install command line tools without sudo
#1025 opened 3 months ago by zjesko
1
Error
#1026 opened 3 months ago by silentghost1412
1
How to deal with id
#1023 opened 4 months ago by 980202006
3
Wrong calculation of max_score in unigram_model.cc
#1024 opened 3 months ago by fairydreaming
0
resume/restart training of tokenizer
#1018 opened 4 months ago by ganeshkrishnan1
3
How long does it take to train 31.2GB text data?
#1021 opened 4 months ago by Mintchocolater
1
Tokenization for phonetic languages
#1009 opened 4 months ago by divyeshrajpura4114
3
I want to obtain a model file using my vocab!
#1017 opened 4 months ago by scj0709
1
Convert SentencePiece .vocab format to OpenNMT-py .onmt_vocab format
#1016 opened 4 months ago by HURIMOZ
1
Build sentencepiece with mingw
#1006 opened 5 months ago by Kreijstal
1
Tokenize at the word level without spacers nor joiners
#1001 opened 5 months ago by HURIMOZ
2
Treat Hawaiian Glottal stop as consonant, not punctuation
#999 opened 5 months ago by HURIMOZ
4
No make file found while build and install the Python wrapper
#1000 opened 5 months ago by NickStrain
2
Is GGUF supported?
#997 opened 6 months ago by micheledellaguardia
1
Windows pip Dependancy Installation Error
#990 opened 6 months ago by Nick-
2
Support for Windows Python 3.12.2
#994 opened 6 months ago by Nick-
0
Error when running this command: pip install 'transformers[tf-cpu]' on mac
#993 opened 6 months ago by ambadumbuya
1
Any api for setting user defined symbols?
#991 opened 6 months ago by zhangyuhanjc
1
Inconsistent result between py and cpp
#992 opened 6 months ago by Lewis-Lu
1
Only Pretokenization
#988 opened 6 months ago by SeverinoDaDalt
3
coredump when build with CXXFLAGS `-Wp,-D_GLIBCXX_ASSERTIONS`
#987 opened 6 months ago by Henry-ZHR
0
High frequency token segmented into letter sequence when input is a tsv file
#967 opened 8 months ago by TingxunShi
3
coredump when build with CXXFLAG `-Wp,-D_GLIBCXX_ASSERTIONS`
#966 opened 7 months ago by samchugit
4
Allow whitespace-only pieces
#984 opened 7 months ago by bauwenst
0
entry points return non-zero exit code (at least for `--help`)
#978 opened 7 months ago by h-vetinari
2
Sequence of byte '<0x09>' as token
#982 opened 7 months ago by SeverinoDaDalt
1
TSV for NFC normalization
#983 opened 7 months ago by JaumePrats
1
Many tests fail
#977 opened 7 months ago by yurivict
2
HELP NEEDED Mask Token in SentencePiece tokenizer HELP NEEDED
#980 opened 7 months ago by debrupf2946
1
Error while installing the library "sentence-transformers" which has dependency on "sentencepiece"
#968 opened 7 months ago by AnkitBaliyan1
11
error while installing sentencepiece python 3.12.2
#976 opened 7 months ago by mistrytejasm
2
Not found google.protobuf packages
#973 opened 8 months ago by CharlinChen
1
RuntimeError
#965 opened 8 months ago by fkurushin
1