Open AI Cost Gateway Pattern

Real-Time Capabilities:
Track Spending By Product (Cost Chargeback) for each and every Request
Rate Limit By Product based on spending Limits (429 Rate Limiting Response when Spending limit has been reached )

Architecture

Open AI Service, Real-Time Cost Tracking And Rate Limiting Per HTTP Request (by Product)

Addtional Capabilities - Any Service, Rate Limiting based on Budget (by Product) and Event Hub Logging

Additional Capabilities:
Rate Limiting based on Budget Alerts

Logging via Event Hubs to Data Lake Hub

High Level Architecture of all Features in the repo

Open AI Transactional Cost Tracking and Rate limiting
Budget Alert Rate Limiting
Event Hub Logging

Streaming Capabilities

Streaming responses do not include Token Information, that must be calculated
Prompt Tokens are calcuated using Additional Python Function API wrapper that uses TikToken :

https://github.com/awkwardindustries/dossier/tree/main/samples/open-ai/tokenizer/azure-function-python-v2

Methods

Create
Update
Budget Alert Endpoint
GetAll
GetById

AOAI Swagger

Repo:
azure-rest-api-specs/specification/cognitiveservices/data-plane/AzureOpenAI/

JSON Repo: https://github.com/Azure/azure-rest-api-specs/blob/main/specification/cognitiveservices/data-plane/AzureOpenAI/inference/stable/2023-05-15/inference.json

JSON File URI: https://raw.githubusercontent.com/Azure/azure-rest-api-specs/main/specification/cognitiveservices/data-plane/AzureOpenAI/inference/stable/2023-05-15/inference.json

Budget Alerts

Latency:
Cost and usage data is typically available within 8-24 hours and budgets are evaluated against these costs every 24 hours.

Documentation:
https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/tutorial-acm-create-budgets

FAQ

Cost API:
Attempted this but Proved to be Overly Complicated. Cost and usage data is typically available within 8-24 hours. would have to create a polling mechanism to call Cost API for each resource to be monitored

Streaming Responses:
when "Stream" : true added to JSON payload, No Token information is provided by Open AI Service.
Prompt Tokens are calculated using a Python Function (PyTokenizer) that wraps a BPE Tokenizer library TikToken
Completion Tokens are calculated by counting the SSE responses and subtracting 2

Granularity of Cost Tracking:
Solution uses APIM Product Subscription Keys but can also be used against individual ID's, header value, etc

Keayoub/Custom-Rate-Limiter-API