This module is not ready for CJK characters

Question

This module is not ready for CJK characters

Closed this issue a month ago · 3 comments

mashihua commented a year ago

We found that this module is not ready for CJK characters， when type ここに内容を入力すると、消費されるメダルの数が計算されます。

OpenAI show:

This module show

The token is different to OpenAI.

Answer 1 · 2023-06-08T20:46:32.000Z

Above you use GPT-3 Encoder and below you use cl100k_base Encoder for GPT3.5 and GPT4
They are 2 difference token encoder , out 2 difference tokens set output

Answer 2 · 2023-07-17T23:43:46.000Z

I checked the output with the same string with p50k_base and it seems to give the same result to OpenAI Tokenizer.
I also tested with a longer string (800 characters) and the number of tokens was the same.
I think it's working fine in CJK.

Answer 3 · 2024-07-18T01:09:50.000Z

as folks in the replies explained, you have selected the incorrect encoder. The tokenizer works correctly.