Add token cost tracking
krschacht opened this issue · 8 comments
I think a very first PR could consist of: internally track how much every message & conversation $ have incurred so that a user can keep a close eye on their total $ spend this month.
High level:
- Add database columns to store cost information
- Wire into the chat logic to keep the cost information updated at the appropriate points in message creation/updating
- Add simple display to the front-end that is something as simple as a single query like
user.messages.created_after(Date.current.beginning_of_month).sum(:estimated_cost)
Off the top of my head, here is how I think an implementation could go:
-
Add a column to the messages table such as
token_count
andprice
-
Open backend/open_ai.rb and find the point where we're actually calling the api (
client.chat
) and add this new flaginclude_usage: true
(explained here and documented here with example code here) -
I think the key thing to validate is: does this final chunk that includes usage definitely report on output and input? Hopefully so. Meaning, we submit a response to OpenAI with a bunch of tokens (input) and then it replies with a bunch of tokens (output). The way I'd figure this out is to simply put a breakpoint where the chunks come in. That is the
stream_handler
method in this same file. -
We can double check the token counting ourselves by putting more breakpoints right when we call
client.chat
and count the number of tokens we are submitting, then after the message finishes streaming it gets saved to the database so we can just count the number of tokens inMessage.last.content_text
-
Once we confirm that this last chunk contains our token count, the content_chunks get passed all the way up to the worker right here so we can set
message.token_count
-
Anthropic, similarly, includes token counts in their streaming chunks. On this page if you search for the string "usage" it shows that their first streamed chunk shows the input tokens and their last streamed chunk shows the output tokens: https://docs.anthropic.com/en/api/messages-streaming The anthropic model is backend/anthropic.rb
-
We can now use SQL to sum the total tokens used during a period, but we want price. I think we do a migration on the language_model to add a
price_per_token
column tolanguage_models
table. We can populate the value for all of our language models from these references: openai and anthropic -
Back in
get_next_ai_message_job
where we are setting token_count we should also do the math and setmessage.price
-
Then I think we can display it somewhere on the person/edit page which is
views/settings/people/_form.html.erb
I think the query will just beMessage.created_after(Date.beginning_of_month).sum(:price)
-
This is a python token cost library that may provide some useful reference
I doubt this price we are tracking will be perfect so we'll display it as an estimated price to the user. It looks like we may need to do some additional calculations for function calling. This should probably be a subsequent PR, but some notes I've collected:
- Explicit reference in anthropic docs about extra tokens used with function calling
- Old discussion on openai about calculating function costs
- I can't find actual mention in openai docs about this so maybe their streamed token count already include it? But why wouldn't anthropic streamed token counts include it? Future investigation...
An estimated_input_token_count
on the message seems useful, but if I understand correctly, we'll need to also add up all of the token counts of prior messages, assuming all of them are sent.
For example, suppose we send 3 messages, each with 100 tokens, and we get 3 replies, each with 100 tokens, and the system message is 200 tokens. Our first message will be 200+(100+100)=400 tokens, our second message will be 200+(100+100)+(100+100)=600 tokens, and our third message will be 200+(100+100)*3=800 tokens. Given the OpenAI pricing for GPT-4o of $5 for 1M input tokens and $15 for 1M output tokens, and with our 500 input tokens and 300 output tokens, we'll get some cost. Is this your understanding?
As well, when we have very long conversations, there will be a moment when some of the preceding messages may be dropped or summarized to reduce the token usage, or at least get it within the context limit. In other words, the number of tokens in the latest message may not help us fully work out the cost of the request. It's almost like we need to keep track of each individual API request and the number of (estimated) input and output tokens.
If the API can give us the actual number of input tokens, even better, which seems possible with OpenAI if we pass the include_usage
streaming option. I don't see any mention of include_usage
inside the OpenAI ruby gem git repo (search query), but might be something that can be surfaced there.
@matthewbennink Yes, that's why I was thinking we update estimated_price twice. When we are generating a new message, we pass a newly created message into get_next_message_job (it's persisted to the db already) which, in turn, passes it into ai_backend.
I think the moment that ai_backend is sending it's request to the API, which includes all of the previous messages in the conversation, we can add up the tokens and save the preliminary estimated_cost on the blank message.
Then when the response comes back to get_next_message_job, we do a final message.save and we can do one more token cost estimate and add it onto the estimated_price we previously calculated.
I didn't know about include_usage, that's cool! The OpenAI gem just passes the hash of params that we send straight on to OpenAI, so it should be supported.
Do you think it's acceptable to store the estimated price as a float inside the database? It'd be per message, and so they'd all be very small values that added up may include some amount of rounding error. It might just average out in the end and/or it might be fine as an estimate.
The alternative would be to store the token counts, perhaps store the input/prompt token count on the "user" messages and store the output/completion token count on the "assistant" messages. (I'm not sure if we'd need to represent "tool" messages differently. Are there other message roles I'm missing?) The monthly price estimate would then need to find all of those messages, sum the token counts by language model, and multiple each token count by its respective cost. That doesn't seem like it'd be particularly slow. E.g.,
input_cost = Message.user.created_after(Date.beginning_of_month).joins(:assistant => :language_model).sum("messages.token_count * language_models.input_cost_per_1m_tokens_in_millionths_of_cents")
output_cost = Message.assistant.created_after(Date.beginning_of_month).joins(:assistant => :language_model).sum("messages.token_count * language_models.output_cost_per_1m_tokens_in_millionths_of_cents")
total_cost_in_cents = input_cost + output_cost
I'm sure I've gotten some of that wrong, but maybe the idea is there. I've never had to represent small prices before, so struggling a bit there. I figure we want to find a way to represent, e.g. 1B tokens per 1¢ as a limit, and then you can use an integer to represent the cost of X cents per 1B tokens based on today's prices. So, $5 / 1M tokens might be represented as 500000, $.01 / 1M tokens as 1000, and $.00001 / 1M tokens as 1, which seems like a price point we'll never get to.
I also think it'd be perfectly reasonable to store the costs as floats per 1M or 1B tokens and just go from there. So, $5000 / 1B, $10 / 1B, and $.01 / 1B in the examples above.
Given it's just an estimate, it's worth keeping things simple perhaps. But wanted to layout the distinction between storing very small prices per message like .00001 USD vs storing integer token values such as 300.
Once we have a data type, I'd be happy to open up a PR to keep things moving.
@matthewbennink hmm, my instinct is to just store the estimate. I think it should be fine to store it as a float. Is the concern you're raising that the estimate will somehow be worse if we store it as a float? I don't think I understand that. Or maybe what you're suggesting is that there is sound rounding that will inevitably occur by storing small floats which wouldn't occur if we stored tokens? I guess the key question is: what's the accuracy of floats in a postgres table? I'm actually not sure of that. I can't think of a time I had to store tiny fractions of a float. That may be a worth a little bit of investigating.
I think that storing currency amounts rather than tokens will be a bit easier to deal with. It makes it so we can do a really nice query like Message.user.created_after(...).sum(:estimate)
. It's not like it's a whole lot more complicated to sum up the tokens, but I don't think we otherwise have any need for token counts beyond estimates so I think it's more straightforward to store estimates. Also, there may be multiple places we want to show estimates like maybe if your cost was really high for the month you might want to click to a detailed view and see cost per conversation. (This is super low priority) Or if it's a team account you may want to see cost by user. I think storing the column as the currency value makes it easier to do a whole range of different queries like this.
One small improvement: instead of storing a DOLLAR value store a CENTS value. So maybe the column is named: estimate_in_cents. By shaving off two decimal points we probably get a lot more accuracy and it's easier for us humans to read 0.03 cents than to read 0.0003 dollars.
And I lean towards each message having a single estimate — and that estimate is the cost for generating that whole message (both the input and output tokens required to generate that message). That could also facilitate a future auto-truncation of history when the per-message cost rises above some cutoff.
I don't think you need to think about tool messages any differently than text messages except in one respect:
- For the input tokens, there's a bunch of different params we pass to the assistant (system instructions, user history, tool data) so be sure to jam ALL of that into some token counter.
- For the output tokens it's not enough to do
token_count( message.content_text )
instead you should probably add a helper method so you can simply domessage.token_count
and this helper method looks at both content_text AND function_response (or whatever the column is called). A single message can include BOTH types of response in that same message. The API doesn't say that but I experience it in practice and the app handles that.
I don't fully understand, why we need the cost estimate as a database column. This is a fixed derived value from the token count and the current LLM price. If the LLM provider changes it's pricing structure in the middle of a monthly period , probably one needs more than one simple magic number per LLM, but the token counts are the truth value from which all costs can be derived.
I understand that a per-LLM overall token count (for input/output tokens) could be used as an optimization means so that one doesn't need to calculate all tokens for a certain period on the fly. And this number could be calculated e.g. via a background job after each LLM roundtrip. The OpenAI cost overview and -detail page is also not exactly real-time, so a slight delay between each LLM round-trip and this calculated DB number should be acceptable.
What would also interest me as a regular user of HostedGPT is not only the effective cost, but also the token count itself. Having the possibilty to make this switchable (klicking on the numbers ?) would be really nice. For non-English languages, often the token count is much higher.
@lumpidu Yes, good point on both. We don’t need cost on messages and we could cache things on the LLM.
I agree on seeing token count. It could just be a simple paren like “$14.32 (13,729 tokens)”
Hi @lumpidu, I wanted to check in on this task and see if you had made any progress on it? And if not, let me know if you're still up for it.
@krschacht, probably later this week I will dive into it