This module is not ready for CJK characters
Closed this issue · 3 comments
mashihua commented
xnohat commented
Above you use GPT-3 Encoder and below you use cl100k_base Encoder for GPT3.5 and GPT4
They are 2 difference token encoder , out 2 difference tokens set output
foloinfo commented
I checked the output with the same string with p50k_base
and it seems to give the same result to OpenAI Tokenizer.
I also tested with a longer string (800 characters) and the number of tokens was the same.
I think it's working fine in CJK.
niieani commented
as folks in the replies explained, you have selected the incorrect encoder. The tokenizer works correctly.