niieani/gpt-tokenizer

This module is not ready for CJK characters

Closed this issue · 3 comments

We found that this module is not ready for CJK characters, when type ここに内容を入力すると、消費されるメダルの数が計算されます。

OpenAI show:

截屏2023-06-08 15 11 21

This module show

截屏2023-06-08 15 12 04

The token is different to OpenAI.

xnohat commented

Above you use GPT-3 Encoder and below you use cl100k_base Encoder for GPT3.5 and GPT4
They are 2 difference token encoder , out 2 difference tokens set output

I checked the output with the same string with p50k_base and it seems to give the same result to OpenAI Tokenizer.
I also tested with a longer string (800 characters) and the number of tokens was the same.
I think it's working fine in CJK.

as folks in the replies explained, you have selected the incorrect encoder. The tokenizer works correctly.