Real-Time Capabilities:
Track Spending By Product (Cost Chargeback) for each and every Request
Rate Limit By Product based on spending Limits (429 Rate Limiting Response when Spending limit has been reached )
Addtional Capabilities - Any Service, Rate Limiting based on Budget (by Product) and Event Hub Logging
Additional Capabilities:
Rate Limiting based on Budget Alerts
Logging via Event Hubs to Data Lake Hub
Open AI Transactional Cost Tracking and Rate limiting
Budget Alert Rate Limiting
Event Hub Logging
Streaming responses do not include Token Information, that must be calculated
Prompt Tokens are calcuated using Additional Python Function API wrapper that uses TikToken :
- Create
- Update
- Budget Alert Endpoint
- GetAll
- GetById
Repo:
azure-rest-api-specs/specification/cognitiveservices/data-plane/AzureOpenAI/
Latency:
Cost and usage data is typically available within 8-24 hours and budgets are evaluated against these costs every 24 hours.
Documentation:
https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/tutorial-acm-create-budgets
Cost API:
Attempted this but Proved to be Overly Complicated. Cost and usage data is typically available within 8-24 hours.
would have to create a polling mechanism to call Cost API for each resource to be monitored
Streaming Responses:
when "Stream" : true added to JSON payload, No Token information is provided by Open AI Service.
Prompt Tokens are calculated using a Python Function (PyTokenizer) that wraps a BPE Tokenizer library TikToken
Completion Tokens are calculated by counting the SSE responses and subtracting 2
Granularity of Cost Tracking:
Solution uses APIM Product Subscription Keys but can also be used against individual ID's, header value, etc