Use LiteLLM to 20x your throughput - load balance between Azure, OpenAI (litellm router docs)
from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "gpt-3.5-turbo",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = await router.completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)
β‘οΈ Get 0 dropped requests for your LLM app in production β‘οΈ
When a request to your llm app fails, reliableGPT handles it by:
- Retrying with an alternate model - GPT-4, GPT3.5, GPT3.5 16k, text-davinci-003
- Retrying with a larger context window model for Context Window Errors
- Sending a Cached Response (using semantic similarity)
- Retry with a fallback API key for Invalid API Key errors
- Join us on Discord or Email us at ishaan@berri.ai & krrish@berri.ai
- Talk to Founders: Learn more / get help onboarding: Meeting Scheduling Link
pip install reliableGPT
Integrating with OpenAI, Azure OpenAI, Langchain, LlamaIndex
from reliablegpt import reliableGPT
openai.ChatCompletion.create = reliableGPT(openai.ChatCompletion.create, user_email='ishaan@berri.ai')
If you experience failure, try
pip install reliableGPT==0.2.976
-
Specify a fallback strategy for handling failed requests: For instance, you can define
fallback_strategy=['gpt-3.5-turbo', 'gpt-4', 'gpt-3.5-turbo-16k', 'text-davinci-003']
, and if you hit an error then reliableGPT will retry with the specified models in the given order until it receives a valid response. This is optional, and reliableGPT also has a default strategy it uses. -
Specify backup tokens: Using your OpenAI keys across multiple servers - and just got one rotated? You can pass backup keys using
add_keys()
. We will store and go through these, in case any get keys get rotated by OpenAI. For security, we use special tokens, and enable you to delete all your keys (usingdelete_keys()
) as well. -
Context Window Errors: For context window errors it automatically retries your request with models with larger context windows
-
Caching If model fallback + retries fails - reliableGPT also provides caching (hosted - not in-memory). You can turn this on with
caching=True
. This also works for request timeout / task queue depth issues. This is optional, scroll down to learn more π.
Here's everything you can pass to reliableGPT
Parameter | Type | Required/Optional | Description |
---|---|---|---|
openai.ChatCompletion.create |
OpenAI method | Required | This is a method from OpenAI, used for calling the OpenAI chat endpoints |
user_email |
string/list | Required | Update you on spikes in errors. You can either set user_email to one email (as user_email = "ishaan@berri.ai") or multiple (as user_email = ["ishaan@berri.ai", "krrish@berri.ai"] if you want to send alerts to multiple emails |
fallback_strategy |
list | Optional | You can define a custom fallback strategy of OpenAI models you want to try using. If you want to try one model several times, then just repeat that e.g. ['gpt-4', 'gpt-4', 'gpt-3.5-turbo'] will try gpt-4 twice before trying gpt-3.5-turbo |
model_limits_dir |
dict | Optional | Note: Required if using queue_requests = True , For models you want to handle rate limits for set model_limits_dir = {"gpt-3.5-turbo": {"max_token_capacity": 1000000, "max_request_capacity": 10000}} You can find your account rate limits here: https://platform.openai.com/account/rate-limits |
user_token |
string | Optional | Pass your user token if you want us to handle OpenAI Invalid Key Errors - we'll rotate through your stored keys (more on this below π) till we get one that works |
azure_fallback_strategy |
List[string] | Optional | Pass your backup azure deployment/engine id's. In case your requests start failing we'll switch to one of these (if you also pass in a backup openai key, we'll try the Azure endpoints before the raw OpenAI ones) |
backup_openai_key |
string | Optional | Pass your OpenAI API key if you're using Azure and want to switch to OpenAI in case your requests start failing |
caching |
bool | Optional | Cache your openai responses, Used as backup in case model fallback fails or overloaded queue (if you're servers are being overwhelmed with requests, it'll alert you and return cached responses, so that customer requests don't get dropped) |
max_threads |
int | Optional | Pass this in alongside caching=True , for it to handle the overloaded queue scenario |
If you're seeing high-traffic and want to make sure all your users get a response, wrap your query endpoint with reliableCache. It monitors for high-thread utilization and responds with cached responses.
from reliablegpt import reliableCache
# max_threads: the maximum number of threads you've allocated for flask to run (by default this is 1).
# query_arg: the variable name you're using to pass the user query to your endpoint (Assuming this is in the params/args)
# customer_instance_arg: unique identifier for that customer's instance (we'll put all cached responses for that customer within this bucket)
# user_email: [REQUIRED] your user email - we will alert you when you're seeing high utilization
cache = reliableCache(max_threads=20, query_arg="query", customer_instance_arg="instance_id", user_email="krrish@berri.ai")
e.g. The number of threads for this flask app is 50
if __name__ == "__main__":
from waitress import serve
serve(app, host="0.0.0.0", port=4000, threads=50)
## Decorate your endpoint with cache.cache_wrapper, this monitors for ..
## .. high thread utilization and sends cached responses when that happens
@app.route("/test_func")
@cache.cache_wrapper
def test_fn():
# your endpoint logic
If you're using Azure OpenAI and facing issues like Read/Request Timeouts, Rate limits, etc. you can use reliableGPT πͺ to fall back to the raw OpenAI endpoints if your Azure OpenAI endpoint fails
from reliablegpt import reliableGPT
Note: This is stored locally.
#Set the backup openai key
openai.ChatCompletion.create = reliableGPT(
openai.ChatCompletion.create,
user_email="krrish@berri.ai",
backup_openai_key=os.getenv("OPENAI_API_KEY"),
fallback_strategy=["gpt-4", "gpt-4-32k"],
verbose=True)
#bad key
openai.api_key = "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"
for question in list_questions:
response = openai.ChatCompletion.create(model="gpt-4", engine="chatgpt-test", messages=[{"role":"user", "content": "Hey! how's it going?"}])
print(response)
If all else fails, reliableGPT will respond with previously cached responses. We store this in a Supabase table and use cosine similarity for similarity based retrieval. Why not in-memory cache? Because when we autoscale / push new updates to our server, we didn't want the cache to be wiped out.
from reliablegpt import reliableGPT
#Set the backup openai key
openai.ChatCompletion.create = reliableGPT(
openai.ChatCompletion.create,
user_email="krrish@berri.ai",
caching=True)
Tell reliableGPT what the maximum number of threads you have, handling your requests for you.
e.g. The number of threads for this flask app is 50
if __name__ == "__main__":
from waitress import serve
serve(app, host="0.0.0.0", port=4000, threads=50)
Tell reliableGPT what the maximum number of threads is - max_threads=50
#Set the backup openai key
openai.ChatCompletion.create = reliableGPT(
openai.ChatCompletion.create,
user_email="krrish@berri.ai",
caching=True,
max_threads=50)
Check out ./reliablegpt/tests/test_Caching
We spin up a flask server, and then run a test script to run a set of questions against the flask server.
from reliablegpt import add_keys, delete_keys, reliableGPT
# Storing your keys π
user_email = "krrish@berri.ai" # π Replace with your email
token = add_keys(user_email, ["openai_key_1", "openai_key_2", "openai_key_3"])
Pass in a list of your openai keys. We will store these and go through them in case any get keys get rotated by OpenAI. You will get a special token, give that to reliableGPT.
import openai
openai.api_key = "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" ## Invalid OpenAI key
print("Initializing reliableGPT πͺ")
openai.ChatCompletion.create = reliableGPT(openai.ChatCompletion.create, user_email= user_email, user_token = token)
reliableGPTπͺ catches the Invalid API Key error thrown by OpenAI and rotates through the remaining keys to ensure you have zero downtime in production.
#Deleting your keys from reliableGPT π«‘
delete_keys(user_email = user_email, user_token=token)
You own your keys, and can delete them whenever you want.
Reach out to us on Discord or Email us at ishaan@berri.ai & krrish@berri.ai