[Feature] Use local tokenizer

Question

[Feature] Use local tokenizer

Closed this issue 5 months ago · 0 comments

FireCoder currently uses a tokenizer to determine the maximum length of the prompt for autocomplete. To achieve this, FireCoder sends the text to the llama.cpp tokenizer endpoint. However, this process takes time and cannot be used if the user is working on the cloud. It is important to provide as much context as possible, but the current method has some issues.

Firstly, to use the llama.cpp tokenizer, the user must download the server and model. However, this is not convenient for users who want to work with the cloud.
Secondly, preparing a prompt can take more than 2 seconds, which can be time-consuming.
Finally, FireCoder has a complex algorithm for selecting the maximum suitable length of the prompt with the minimum request to llama.cpp.

The solution is to use a local tokenizer that can be directly called from the extension. There are two possible options for this:

Use tokenizers, but it works poorly when combined with nodejs bindings, so further investigation is needed.
Use transformers.js, which should work well, but it still needs to be tested.