simonkurtz-MSFT/python-openai-loadbalancer

Account for `retry-after-ms` header

Opened this issue · 0 comments

Presently, the module only accounts for the retry-after header containing a value in seconds. Azure OpenAI Provisioned Throughput (PTU) deployments also pass back the retry-after-ms header containing a value in milliseconds. This value may be preferential to seconds.

From the PTU documentation:

*A 429 response indicates that the allocated PTUs are fully consumed at the time of the call. The response includes the retry-after-ms and retry-after headers that tell you the time to wait before the next call will be accepted. *

As such, there is only a longer delay when not using retry-after-ms when there is simply no PTU left to consume (overall and/or tokens-per-minute?). This may be an acceptable situation that may not surface too often.


One approach may be to convert everything internal to the module to milliseconds. These resources provide the starting point for the enhancement:

cc @kristapratico