comparison to other tokenizers

Question

comparison to other tokenizers

Closed this issue a year ago · 2 comments

transitive-bullshit commented a year ago

This library looks great.

I tried to add it to https://github.com/transitive-bullshit/compare-tokenizers, but kept running into various ESM import issues.

I'd love to compare it to the other node.js tokenizers on a consistent test set for both accuracy and speed.

Also, the one thing this library is missing currently (from what I could tell; I wasn't able to get it working in my test bed) is a dynamic function to return the tokenizer given a model name. I know the examples show you can do this statically using imports, but for a lot of libraries, the model needs to be customizable at runtime.

Thanks!

Answer 1 · 2023-06-01T08:03:55.000Z

Thanks @transitive-bullshit!
I saw the issue with default imports and fixed it. Latest version should have it fixed.

Submitted a PR to your comparison repo: transitive-bullshit/compare-tokenizers#3.
I see there's some room for improvement in my package regarding performance.
I believe the extra safety features of gpt-tokenizer is what's slowing it down currently.
I'll try to get it down by making the safety (allowedSpecialTokens) optional.

Answer 2 · 2023-06-01T08:05:48.000Z

🎉 This issue has been resolved in version 2.1.1 🎉

The release is available on:

Your semantic-release bot 📦🚀