niieani/gpt-tokenizer

comparison to other tokenizers

Closed this issue · 2 comments

This library looks great.

I tried to add it to https://github.com/transitive-bullshit/compare-tokenizers, but kept running into various ESM import issues.

I'd love to compare it to the other node.js tokenizers on a consistent test set for both accuracy and speed.

Also, the one thing this library is missing currently (from what I could tell; I wasn't able to get it working in my test bed) is a dynamic function to return the tokenizer given a model name. I know the examples show you can do this statically using imports, but for a lot of libraries, the model needs to be customizable at runtime.

Thanks!

Thanks @transitive-bullshit!
I saw the issue with default imports and fixed it. Latest version should have it fixed.

Submitted a PR to your comparison repo: transitive-bullshit/compare-tokenizers#3.
I see there's some room for improvement in my package regarding performance.
I believe the extra safety features of gpt-tokenizer is what's slowing it down currently.
I'll try to get it down by making the safety (allowedSpecialTokens) optional.

🎉 This issue has been resolved in version 2.1.1 🎉

The release is available on:

Your semantic-release bot 📦🚀