[Bug]: Error calling completion on a deployed VertexAI Model Garden LLM endpoint
Closed this issue · 7 comments
What happened?
We have a Llama 3.1 8B model deployed from VertexAI Model Garden and made available for inference through model endpoint. It takes input in a specific format and generates output as given below,
JSON request,
curl \
-X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" \
"https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/endpoints/${ENDPOINT_ID}:predict" \
-d '{ "instances": [{"prompt": "What is machine learning?", "max_tokens": 100}] }'
Response
{
"predictions": [
"Prompt:\nWhat is machine learning?\nOutput:\n A broad introduction\nMachine learning is..."
],
"deployedModelId": "xxxx",
"model": "projects/xxxx/locations/us-central1/models/llama-3-1-8b-instruct-172858156xxxx",
"modelDisplayName": "llama-3-1-8b-instruct-172858156xxxx",
"modelVersionId": "1"
}
We are using LiteLLM v1.50.0-stable
version and we tried to configure above deployed Llama 3.1 model on LiteLLM as below,
{
"model_name": "vertex_ai/meta/llama3-8b-instruct-deployed",
"litellm_params": {
"vertex_project": "xxxxxxxxxxxxx",
"vertex_location": "us-central1",
"model": "vertex_ai/320911490117586xxxx"
},
...
}
While making a completion call with a typical payload given below,
curl -X POST 'https://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-xxxxxxxx' \
-d '{ "model": "vertex_ai/meta/llama3-8b-instruct-deployed", "messages": [ { "role": "user", "content": "What is the weather like in Boston today?"} ] }'
Getting a HTTP 500 error response from LiteLLM as given below,
Error occurred while generating model response. Please try again. Error: Error: 500 litellm.APIConnectionError: 500 Internal Server Error Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/google/api_core/grpc_helpers_async.py", line 85, in __await__ response = yield from self._call.__await__() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/grpc/aio/_call.py", line 327,
in __await__ raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.INTERNAL details = "Internal Server Error" debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.191.234:443 {grpc_message:"Internal Server Error", grpc_status:13, created_time:"2024-10-28T22:42:25.45051005+00:00"}" > The above exception was the direct cause of the following exception: Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/litellm/main.py", line 455,
in acompletion response = await init_response ^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/litellm/llms/vertex_ai_and_google_ai_studio/vertex_ai_non_gemini.py", line 1125,
in async_streaming response_obj = await llm_model.predict( ^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/google/cloud/aiplatform_v1/services/prediction_service/async_client.py", line 404, in predict response = await rpc( ^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/google/api_core/grpc_helpers_async.py", line 88, in __await__ raise exceptions.from_grpc_error(rpc_error) from rpc_error google.api_core.exceptions.InternalServerError: 500 Internal Server Error Received Model Group=vertex_ai/meta/llama3-8b-instruct-deployed Available Model Group Fallbacks=None
While analyzing VertexAI model endpoint logs, found below error trace,
TypeError: SamplingParams.init() got an unexpected keyword argument 'max_retries'
ERROR 2024-10-28T22:49:37.335386991Z ERROR: Exception in ASGI application
ERROR 2024-10-28T22:49:37.335420608Z Traceback (most recent call last):
ERROR 2024-10-28T22:49:37.335427761Z File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
ERROR 2024-10-28T22:49:37.335432529Z result = await app( # type: ignore[func-returns-value]
ERROR 2024-10-28T22:49:37.335437297Z File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
ERROR 2024-10-28T22:49:37.335441827Z return await self.app(scope, receive, send)
ERROR 2024-10-28T22:49:37.335447311Z File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
ERROR 2024-10-28T22:49:37.335451364Z await super().__call__(scope, receive, send)
ERROR 2024-10-28T22:49:37.335455417Z File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
ERROR 2024-10-28T22:49:37.335459470Z await self.middleware_stack(scope, receive, send)
ERROR 2024-10-28T22:49:37.335464Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
ERROR 2024-10-28T22:49:37.335468053Z raise exc
ERROR 2024-10-28T22:49:37.335471868Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
ERROR 2024-10-28T22:49:37.335475683Z await self.app(scope, receive, _send)
ERROR 2024-10-28T22:49:37.335479497Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
ERROR 2024-10-28T22:49:37.335483789Z await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
ERROR 2024-10-28T22:49:37.335487604Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR 2024-10-28T22:49:37.335491418Z raise exc
ERROR 2024-10-28T22:49:37.335495471Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR 2024-10-28T22:49:37.335499286Z await app(scope, receive, sender)
ERROR 2024-10-28T22:49:37.335503339Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
ERROR 2024-10-28T22:49:37.335507154Z await self.middleware_stack(scope, receive, send)
ERROR 2024-10-28T22:49:37.335510969Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
ERROR 2024-10-28T22:49:37.335515022Z await route.handle(scope, receive, send)
ERROR 2024-10-28T22:49:37.335518836Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
ERROR 2024-10-28T22:49:37.335522651Z await self.app(scope, receive, send)
ERROR 2024-10-28T22:49:37.335526943Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
ERROR 2024-10-28T22:49:37.335531234Z await wrap_app_handling_exceptions(app, request)(scope, receive, send)
ERROR 2024-10-28T22:49:37.335549354Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR 2024-10-28T22:49:37.335553407Z raise exc
ERROR 2024-10-28T22:49:37.335557222Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR 2024-10-28T22:49:37.335561037Z await app(scope, receive, sender)
ERROR 2024-10-28T22:49:37.335565090Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
ERROR 2024-10-28T22:49:37.335569620Z response = await f(request)
ERROR 2024-10-28T22:49:37.335573673Z File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
ERROR 2024-10-28T22:49:37.335577726Z raw_response = await run_endpoint_function(
ERROR 2024-10-28T22:49:37.335581541Z File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
ERROR 2024-10-28T22:49:37.335585832Z return await dependant.call(**values)
ERROR 2024-10-28T22:49:37.335590124Z File "/workspace/vllm/vllm/entrypoints/api_server.py", line 176, in generate
ERROR 2024-10-28T22:49:37.335594177Z sampling_params = SamplingParams(**request_dict)
ERROR 2024-10-28T22:49:37.335597991Z TypeError: SamplingParams.__init__() got an unexpected keyword argument 'max_retries'
Relevant log output
No response
Twitter / LinkedIn details
No response
@suresiva we already support vertex ai llama on model garden. Please look at the relevant docs - https://docs.litellm.ai/docs/providers/vertex#llama-3-api
@krrishdholakia, there are 2 ways we can deploy Llama 3.1 on Vertex AI.
- Fully Managed API service
- Works well through LiteLLM, easy to configure.
- Given documentation - https://docs.litellm.ai/docs/providers/vertex#llama-3-api - applicable for this.
- Using model name while adding model into LiteLLM, (i.e. meta/llama3-405b-instruct-maas)
- Self Deployed LLM endpoint from Model Garden
- Facing the error messages posted in this thread.
- Given document does not work for this setup
- Instead followed this documentation - https://docs.litellm.ai/docs/providers/vertex#model-garden
- Used endpoint_id to add the model in to LiteLLM, (i.e. vertex_ai/<endpoint_id>)
We are currently facing actual error posted on this thread while using the second option (self-deployed LLM endpoint in Model Garden). Please let us know how we can resolve the errors.
if you self deploy is it the same api spec? @suresiva
if so, it seems like we just need to let you specify this distinction - hey this is model follows the vertex/meta spec
@krrishdholakia , self-deployed Llama 3.1 model follows different request/response spec,
Request,
{ "instances": [{"prompt": "What is machine learning?", "max_tokens": 100}] }
Response,
{
"predictions": [
"Prompt:\nWhat is machine learning?\nOutput:\n A broad introduction\nMachine learning is..."
],
"deployedModelId": "xxxx",
"model": "projects/xxxx/locations/us-central1/models/llama-3-1-8b-instruct-172858156xxxx",
"modelDisplayName": "llama-3-1-8b-instruct-172858156xxxx",
"modelVersionId": "1"
}
Behind the scenes, this self-deployed Llama 3.1 model is actually deployed through vllm.entrypoints.api_server
entrypoint, which does not use the OpenAI's spec.

@krrishdholakia , the input request composed by LiteLLM when calling a custom LLM on Vertex_AI has an unexpected param `optional_params.max_retries which causes the Vertex_AI prediction call to fail and results in a HTTP 500 error. Below log is from Vertex_AI model endpoint,
ERROR 2024-10-28T22:49:37.335386991Z ERROR: Exception in ASGI application
ERROR 2024-10-28T22:49:37.335420608Z Traceback (most recent call last):
ERROR 2024-10-28T22:49:37.335427761Z File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
ERROR 2024-10-28T22:49:37.335432529Z result = await app( # type: ignore[func-returns-value]
ERROR 2024-10-28T22:49:37.335437297Z File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
ERROR 2024-10-28T22:49:37.335441827Z return await self.app(scope, receive, send)
ERROR 2024-10-28T22:49:37.335447311Z File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
ERROR 2024-10-28T22:49:37.335451364Z await super().__call__(scope, receive, send)
ERROR 2024-10-28T22:49:37.335455417Z File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
ERROR 2024-10-28T22:49:37.335459470Z await self.middleware_stack(scope, receive, send)
ERROR 2024-10-28T22:49:37.335464Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
ERROR 2024-10-28T22:49:37.335468053Z raise exc
ERROR 2024-10-28T22:49:37.335471868Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
ERROR 2024-10-28T22:49:37.335475683Z await self.app(scope, receive, _send)
ERROR 2024-10-28T22:49:37.335479497Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
ERROR 2024-10-28T22:49:37.335483789Z await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
ERROR 2024-10-28T22:49:37.335487604Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR 2024-10-28T22:49:37.335491418Z raise exc
ERROR 2024-10-28T22:49:37.335495471Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR 2024-10-28T22:49:37.335499286Z await app(scope, receive, sender)
ERROR 2024-10-28T22:49:37.335503339Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
ERROR 2024-10-28T22:49:37.335507154Z await self.middleware_stack(scope, receive, send)
ERROR 2024-10-28T22:49:37.335510969Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
ERROR 2024-10-28T22:49:37.335515022Z await route.handle(scope, receive, send)
ERROR 2024-10-28T22:49:37.335518836Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
ERROR 2024-10-28T22:49:37.335522651Z await self.app(scope, receive, send)
ERROR 2024-10-28T22:49:37.335526943Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
ERROR 2024-10-28T22:49:37.335531234Z await wrap_app_handling_exceptions(app, request)(scope, receive, send)
ERROR 2024-10-28T22:49:37.335549354Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR 2024-10-28T22:49:37.335553407Z raise exc
ERROR 2024-10-28T22:49:37.335557222Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR 2024-10-28T22:49:37.335561037Z await app(scope, receive, sender)
ERROR 2024-10-28T22:49:37.335565090Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
ERROR 2024-10-28T22:49:37.335569620Z response = await f(request)
ERROR 2024-10-28T22:49:37.335573673Z File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
ERROR 2024-10-28T22:49:37.335577726Z raw_response = await run_endpoint_function(
ERROR 2024-10-28T22:49:37.335581541Z File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
ERROR 2024-10-28T22:49:37.335585832Z return await dependant.call(**values)
ERROR 2024-10-28T22:49:37.335590124Z File "/workspace/vllm/vllm/entrypoints/api_server.py", line 176, in generate
ERROR 2024-10-28T22:49:37.335594177Z sampling_params = SamplingParams(**request_dict)
ERROR 2024-10-28T22:49:37.335597991Z TypeError: SamplingParams.__init__() got an unexpected keyword argument 'max_retries'
Removing this optional_params.max_retries before calling the custom Vertex_AI model has fixed the issue. Please check this PR #6692
It looks like all vertex model garden models have this generic endpoint attached. All we need to do is to route those models to this endpoint
v1beta1/projects/{PROJECT_ID}/locations/{self.endpoint.location}/endpoints/openapi/chat/completions