google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
C++Apache-2.0
Issues
- 0
logprobs in the vocabulary file do not match the values computed from the tokenized training document
#1050 opened by pnugues - 0
- 0
Crashes on out of range inputs depending on other inputs
#1051 opened by colehaus - 0
With unigram algorithm, constant piece at end of each sentences does not become a token
#1047 opened by jogardi - 0
builds for android devices
#1045 opened by RaoufiTech - 0
Error Attribute Error: type object 'SentencePieceTrainer' has no attribute 'train'. Did you mean: 'Train'?
#1046 opened by bop578530 - 1
decode token one by one
#1044 opened by nigelzzz - 2
decode one by one can't show space
#1043 opened by nigelzzz - 2
Why is the Hugging Face encoding 1 greater compared to the Google SentencePiece encoding when using the XLM-RoBERTa SentencePiece tokenizer?
#1042 opened by RaoufiTech - 11
Runtime error on iOS
#1010 opened by l3utterfly - 2
pip subprocess to install build dependencies did not run successfully. │ exit code: 1
#989 opened by Anubiiss - 1
- 7
- 1
trainer_interface.cc: Integer value -1 is outside the valid range of values [0, 255] for the enumeration type 'ScriptType'
#1028 opened by kcoul - 0
No typings in Python package
#1030 opened by marcospgp - 0
Zero Width Joiner issue for Sinhala Language
#1031 opened by Nadil-K - 0
When I set SPM_PROTOBUF_PROVIDER to "package" in CMakeLists.txt, the compilation fails.
#1029 opened by hhxdestiny - 1
install command line tools without sudo
#1025 opened by zjesko - 1
Error
#1026 opened by silentghost1412 - 3
How to deal with id
#1023 opened by 980202006 - 0
Wrong calculation of max_score in unigram_model.cc
#1024 opened by fairydreaming - 3
resume/restart training of tokenizer
#1018 opened by ganeshkrishnan1 - 1
How long does it take to train 31.2GB text data?
#1021 opened by Mintchocolater - 3
Tokenization for phonetic languages
#1009 opened by divyeshrajpura4114 - 1
I want to obtain a model file using my vocab!
#1017 opened by scj0709 - 1
- 1
Build sentencepiece with mingw
#1006 opened by Kreijstal - 2
Tokenize at the word level without spacers nor joiners
#1001 opened by HURIMOZ - 4
- 2
- 1
Is GGUF supported?
#997 opened by micheledellaguardia - 2
Windows pip Dependancy Installation Error
#990 opened by Nick- - 0
Support for Windows Python 3.12.2
#994 opened by Nick- - 1
Error when running this command: pip install 'transformers[tf-cpu]' on mac
#993 opened by ambadumbuya - 1
Any api for setting user defined symbols?
#991 opened by zhangyuhanjc - 1
Inconsistent result between py and cpp
#992 opened by Lewis-Lu - 3
Only Pretokenization
#988 opened by SeverinoDaDalt - 0
- 3
High frequency token segmented into letter sequence when input is a tsv file
#967 opened by TingxunShi - 4
- 0
Allow whitespace-only pieces
#984 opened by bauwenst - 2
- 1
Sequence of byte '<0x09>' as token
#982 opened by SeverinoDaDalt - 1
TSV for NFC normalization
#983 opened by JaumePrats - 2
Many tests fail
#977 opened by yurivict - 1
- 11
Error while installing the library "sentence-transformers" which has dependency on "sentencepiece"
#968 opened by AnkitBaliyan1 - 2
- 1
Not found google.protobuf packages
#973 opened by CharlinChen - 1
RuntimeError
#965 opened by fkurushin