abetlen/llama-cpp-python

Include usage key in create_completion when streaming

zhudotexe opened this issue · 0 comments

Is your feature request related to a problem? Please describe.
Since create_completion may yield text chunks comprised of multiple tokens per yield (e.g. in the case of multi-byte Unicode characters), counting the number of yields may not equal the number of tokens actually generated by a model. To accurately get the usage statistics of a streamed completion, one has to run the final text through the tokenizer again, despite create_completion already tracking the number of tokens generated by the model.

Describe the solution you'd like
When stream=True in create_completion, the final chunk yielded should include the usage statistics in the 'usage' key.

Describe alternatives you've considered

  • Saving full generated text and running it through the tokenizer again (seems wasteful)
  • Counting the number of yields and hoping we don't have any multi-byte characters (hacky and fragile)

Additional context
The OpenAI API has recently added similar support in their streaming API with the stream_options key: https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options