How can I get the correct output at the edge?
gablabelle opened this issue · 1 comments
gablabelle commented
Using issues for a question... Sorry about that.
When using the following the output doesn't match the output from the online tiktokenizer and it has a different length :
import model from "tiktoken/encoders/cl100k_base";
import { init, Tiktoken } from "tiktoken/lite/init";
// @ts-expect-error
import wasm from "tiktoken/lite/tiktoken_bg.wasm?module";
export const runtime = "edge";
// ...
await init((imports) => WebAssembly.instantiate(wasm, imports));
const inputText = getChatGPTEncoding(messages, "gpt-3.5-turbo");
const encoding = new Tiktoken(
model.bpe_ranks,
model.special_tokens,
model.pat_str
);
const tokens = encoding.encode(inputText);
encoding.free();
return new Response(`${tokens}`);
For the following input (What is saved into the inputText
variable):
<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>\n<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n
I get following tokens for gpt-3.5-turbo
at https://tiktokenizer.vercel.app/ :
[100264, 882, 1734, 15339, 100265, 1734, 100264, 78191, 1734, 9906, 0, 2650, 649, 358, 7945, 499, 3432, 30, 100265, 1734, 100264, 882, 1734, 13347, 100265, 1734, 100264, 78191, 1734]
But when running the code I get the following tokens:
[27,91,318,5011,91,29,882,198,15339,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198,9906,0,2650,649,358,7945,499,3432,76514,91,318,6345,91,397,27,91,318,5011,91,29,882,198,13347,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198]
gablabelle commented
After reading a few issues at #12 and dqbd/tiktoken#65 here is what worked:
const encoding = new Tiktoken(
model.bpe_ranks,
{
...model.special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
"<|im_sep|>": 100266,
},
model.pat_str
);
const tokens = encoding.encode(inputText, "all");