A test suite comparing Node.js BPE tokenizers for use with AI models.
This repo contains a small test suite for comparing the results of different Node.js BPE tokenizers for use with LLMs like GPT-3.
Check out OpenAI's tiktoken Rust / Python lib for reference and OpenAI's Tokenizer Playground to experiment with different inputs.
This repo only tests tokenizers aimed at text, not code-specific tokenizers like the ones used by Codex.
Package / encoder | Average Time (ms) | Variance (ms) |
---|---|---|
gpt3-tokenizer | 56132 | 334621 |
gpt-3-encoder | 31148 | 333120 |
@dqbd/tiktoken gpt2 | 9267 | 1490 |
@dqbd/tiktoken text-davinci-003 | 9078 | 733 |
(lower times are better)
@dqbd/tiktoken
which is a wasm port of the official Rust tiktoken
is ~3-6x faster than the JS variants with significantly less memory overhead and variance. π₯
To reproduce:
npx tsx src/bench.ts
# or
pnpm build
node build/bench.mjs
This maps over an array of test fixtures in different languages and prints the number of tokens generated for each of the tokenizers.
0) 5 chars "hello" β {
'gpt3-tokenizer': 1,
'gpt-3-encoder': 1,
'@dqbd/tiktoken gpt2': 1,
'@dqbd/tiktoken text-davinci-003': 1
}
1) 17 chars "hello π world π" β {
'gpt3-tokenizer': 7,
'gpt-3-encoder': 7,
'@dqbd/tiktoken gpt2': 7,
'@dqbd/tiktoken text-davinci-003': 7
}
2) 445 chars "Lorem ipsum dolor si..." β {
'gpt3-tokenizer': 153,
'gpt-3-encoder': 153,
'@dqbd/tiktoken gpt2': 153,
'@dqbd/tiktoken text-davinci-003': 153
}
3) 2636 chars "Lorem ipsum dolor si..." β {
'gpt3-tokenizer': 939,
'gpt-3-encoder': 939,
'@dqbd/tiktoken gpt2': 939,
'@dqbd/tiktoken text-davinci-003': 922
}
4) 246 chars "δΉη§°δΉ±ζ°εζζθ
εε
ζζ¬οΌ ζ―ε°ε·εζη..." β {
'gpt3-tokenizer': 402,
'gpt-3-encoder': 402,
'@dqbd/tiktoken gpt2': 402,
'@dqbd/tiktoken text-davinci-003': 402
}
5) 359 chars "ε©γγͺγγ²ηΉιγγζζΈθ³Όγ΅η±³ε
¬γεΊδΈ»γγ»..." β {
'gpt3-tokenizer': 621,
'gpt-3-encoder': 621,
'@dqbd/tiktoken gpt2': 621,
'@dqbd/tiktoken text-davinci-003': 621
}
6) 2799 chars "ΡΡΠΎ ΡΠ΅ΠΊΡΡ-"ΡΡΠ±Π°", ΡΠ°..." β {
'gpt3-tokenizer': 2813,
'gpt-3-encoder': 2813,
'@dqbd/tiktoken gpt2': 2813,
'@dqbd/tiktoken text-davinci-003': 2811
}
7) 658 chars "If the dull substanc..." β {
'gpt3-tokenizer': 175,
'gpt-3-encoder': 175,
'@dqbd/tiktoken gpt2': 175,
'@dqbd/tiktoken text-davinci-003': 170
}
8) 3189 chars "Enter [two Players a..." β {
'gpt3-tokenizer': 876,
'gpt-3-encoder': 876,
'@dqbd/tiktoken gpt2': 876,
'@dqbd/tiktoken text-davinci-003': 872
}
9) 17170 chars "ANTONY. [To CAESAR] ..." β {
'gpt3-tokenizer': 5801,
'gpt-3-encoder': 5801,
'@dqbd/tiktoken gpt2': 5801,
'@dqbd/tiktoken text-davinci-003': 5306
}
To reproduce:
npx tsx src/index.ts
# or
pnpm build
node build/index.mjs
MIT Β© Travis Fischer
If you found this project interesting, please consider sponsoring me or following me on twitter