niieani/gpt-tokenizer

Tokens are slightly different from OpenAI Tokenizer

Closed this issue · 1 comments

Using the sentense "Welcome to gpt-tokenizer. Replace this with your text to see how tokenization works." I'm getting 20 tokens from OpenAI Tokenizer. Using https://gpt-tokenizer.dev/ I'm getting there 19.

gpt-tokenizer.dev OpenAI Tokenizer
image image

Another example:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Facilisis gravida neque convallis a cras semper auctor neque vitae. Nunc mattis enim ut tellus elementum sagittis vitae et leo. Tellus rutrum tellus pellentesque eu tincidunt tortor aliquam nulla facilisi. Volutpat lacus laoreet non curabitur gravida arcu ac. Diam phasellus vestibulum lorem sed risus ultricies tristique nulla aliquet.

Using gpt-tokenizer.dev: 119 tokens
Using OpenAI Tokenizer: 158 tokens