Multimodal inputs: count image tokens

Question

Multimodal inputs: count image tokens

Closed this issue 2 months ago · 0 comments

Description:

This update will allow for a more accurate estimation of token usage in multimodal chatml messages, enhancing the system's ability to manage and optimize resource usage effectively.

To support multimodal models, the content field of chatml messages can now be a list of [{type: "image_url", image_url: {...}}, {type: "text", text: "..."}] and so on. Also each image part has a detail setting which by default is set to "auto" but can be "low" or "high" like this: [..., {type: "image_url", image_url: {url: "...", detail: "low"}}]

"detail": "low" is simple, just 85 tokens
"detail": "high" needs calculations as described below.
"detail": "auto" means the model decides but in that case, we have to assume "high"

Relevant code locations:

agents_api/common/protocol/entries.py

Documentation:

According to the pricing page 784, every image is resized (if too big) in order to fit in a 1024x1024 square, and is first globally described by 85 base tokens.

Tiles
To be fully recognized, an image is covered by 512x512 tiles.
Each tile provides 170 tokens. So, by default, the formula is the following:
total tokens = 85 + 170 * n, where n = the number of tiles needed to cover your image.

https://community.openai.com/t/how-do-i-calculate-image-tokens-in-gpt4-vision/492318

Final expected outcome:

The protocol in agents-api/agents_api/common/protocol/entries.py now includes functionality for calculating tokens for image parts from input chatml messages.
The Entry model's token_count property calculation has been updated to account for image parts alongside text content.
For image parts, a token count value is calculate per image to approximate the complexity and information content images contribute to the chatml messages.
The image token count is added to the total tokens of that message