[Feature]: An alternative to `max_tokens` which defaults to `minimum(max_tokens, remaining_tokens)`

Question

[Feature]: An alternative to `max_tokens` which defaults to `minimum(max_tokens, remaining_tokens)`

Opened this issue a month ago · 0 comments

🚀 The feature, motivation and pitch

Where remaining_tokens is --max-model-len minus the prompt tokens. So, if you specify 1024, then you'll get up to 1024 tokens, unless e.g. there are 3500 prompt tokens, and --max-model-len is 4000, in which case you'll get up to 500 tokens.

This default is more helpful in some situations. There are cases where this reduces the number of places in a system where the tokenizer needs to be loaded/known, and where token limit calculations need to be done.

In my case, I'm working on a in-browser eval framework where it is known up-front that all prompts + outputs will be under 4096 in total, and it'd be handy to not have to load the tokenizer in the browser (with Transformers.js), but currently I need to do that because if I specify max_tokens:1024, then I'll get an error for prompts which have e.g. 3500 tokens.

Alternatives

Currently my workaround is to use max_tokens and load the tokenizer on the client using Transformers.js

Additional context

TGI made this change based on user feedback:

max_new_tokens = max_total_tokens - input_length huggingface/text-generation-inference#1097

A breaking change is not needed here, to be clear. There could be a new boolean param which changes the behavior of max_tokens, or a new param which is mutually exclusive with max_tokens (an error is thrown if both are specified) and has the above-specified behavior.