dqbd/tiktokenizer

How can I get the correct output at the edge?

gablabelle opened this issue · 1 comments

Using issues for a question... Sorry about that.

When using the following the output doesn't match the output from the online tiktokenizer and it has a different length :

      import model from "tiktoken/encoders/cl100k_base";
      import { init, Tiktoken } from "tiktoken/lite/init";
      // @ts-expect-error
      import wasm from "tiktoken/lite/tiktoken_bg.wasm?module";

      export const runtime = "edge";
      // ...

      await init((imports) => WebAssembly.instantiate(wasm, imports));
      const inputText = getChatGPTEncoding(messages, "gpt-3.5-turbo");
      const encoding = new Tiktoken(
        model.bpe_ranks,
        model.special_tokens,
        model.pat_str
      );
      const tokens = encoding.encode(inputText);
      encoding.free();
      return new Response(`${tokens}`);

For the following input (What is saved into the inputText variable):

<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>\n<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n

I get following tokens for gpt-3.5-turbo at https://tiktokenizer.vercel.app/ :

[100264, 882, 1734, 15339, 100265, 1734, 100264, 78191, 1734, 9906, 0, 2650, 649, 358, 7945, 499, 3432, 30, 100265, 1734, 100264, 882, 1734, 13347, 100265, 1734, 100264, 78191, 1734]

But when running the code I get the following tokens:

[27,91,318,5011,91,29,882,198,15339,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198,9906,0,2650,649,358,7945,499,3432,76514,91,318,6345,91,397,27,91,318,5011,91,29,882,198,13347,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198]

After reading a few issues at #12 and dqbd/tiktoken#65 here is what worked:

      const encoding = new Tiktoken(
        model.bpe_ranks,
        {
          ...model.special_tokens,
          "<|im_start|>": 100264,
          "<|im_end|>": 100265,
          "<|im_sep|>": 100266,
        },
        model.pat_str
      );
      const tokens = encoding.encode(inputText, "all");