Receiving Error: MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE
MostafaOmar98 opened this issue · 8 comments
Hello, so we have been seeing the following error:
<_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-06-11T07:08:48.917638822+00:00"}"
>
Facts we know so far:
- It seems to be a transient error.
- It is not directly related to the spanner server instance but rather the connection between our application and spanner.
- It is not related to 1 specific query or application. Seems to be happening across different queries on different services.
- The error could be masked with retrial: https://cloud.google.com/spanner/docs/custom-timeout-and-retry. However, we see the rate of this error going up and down seemingly arbitrarily to us.
We have contacted the google support team and they have recommended we get insights by raising the issue on the client library. We acknowledge that we can mask this transient error by implementing a retrial mechanism. However, we are very interested in knowing what causes it and what factors cause this error to increase/decrease in its rate. We have a very performance-critical service that is getting affected by this error, so we would like to implement mechanisms to keep the error rate at its minimum and constant before we do a retrial on top of it.
Environment details
- OS type and version: Debian 12.5
- Python version: 3.10.14
- pip version: 24.0
google-cloud-spanner
version: "3.46.0"
Steps to reproduce
- Run a query enough amount of times for this transient error to happen
Code example
# init code
client = Client("project name")
instance = client.instance("instance name")
pool = PingingPool(
size=20,
default_timeout=10,
ping_interval=300
)
self.db = instance.database(db, pool=pool)
SpannerDB.background_pool_pinging(pool)
# query execution code
query = "SELECT <> FROM <table>"
with self.db.snapshot() as snapshot:
res = snapshot.execute_sql(query)
# background pinging pool code
def background_pool_pinging(pool):
import threading
import time
def target():
while True:
pool.ping()
time.sleep(10)
background = threading.Thread(target=target, name='spanner-ping-pool')
background.daemon = True
background.start()
Stack trace
(censored internal function name/files)
_MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-06-11T07:34:07.272354902+00:00"}"
>
File "/opt/venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 170, in error_remapped_callable
return _StreamingResponseIterator(
File "/opt/venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 92, in __init__
self._stored_first_result = next(self._wrapped)
File "grpc/_channel.py", line 541, in __next__
return self._next()
File "grpc/_channel.py", line 967, in _next
raise self
ServiceUnavailable: Socket closed
File "starlette/applications.py", line 124, in __call__
await self.middleware_stack(scope, receive, send)
File "starlette/middleware/errors.py", line 184, in __call__
raise exc
File "starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "starlette/middleware/base.py", line 72, in __call__
response = await self.dispatch_func(request, call_next)
File "starlette/middleware/base.py", line 46, in call_next
raise app_exc
File "starlette/middleware/base.py", line 36, in coro
await self.app(scope, request.receive, send_stream.send)
File "/opt/venv/lib/python3.10/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 581, in __call__
await self.app(scope, otel_receive, otel_send)
File "starlette/middleware/base.py", line 72, in __call__
response = await self.dispatch_func(request, call_next)
File "********", line 149, in dispatch
response = await call_next(request)
File "starlette/middleware/base.py", line 46, in call_next
raise app_exc
File "starlette/middleware/base.py", line 36, in coro
await self.app(scope, request.receive, send_stream.send)
File "starlette/middleware/exceptions.py", line 75, in __call__
raise exc
File "starlette/middleware/exceptions.py", line 64, in __call__
await self.app(scope, receive, sender)
File "fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "starlette/routing.py", line 680, in __call__
await route.handle(scope, receive, send)
File "starlette/routing.py", line 275, in handle
await self.app(scope, receive, send)
File "starlette/routing.py", line 65, in app
response = await func(request)
File "********", line 35, in custom_route_handler
response = await original_route_handler(request)
File "fastapi/routing.py", line 231, in app
raw_response = await run_endpoint_function(
File "fastapi/routing.py", line 162, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "********", line 70, in ********
return ********(
File "********", line 125, in ********
******** = ********(
File "********", line 519, in get_rocket_warehouse_legs
spanner_ctx.spanner_conn.execute_query(
File "********", line 122, in execute_query
return self.execute_sql(query, max_staleness_seconds, **new_kwargs)
File "********", line 118, in execute_sql
return SpannerProxy(res)
File "********", line 21, in __new__
first = next(it)
File "/opt/venv/lib/python3.10/site-packages/google/cloud/spanner_v1/streamed.py", line 145, in __iter__
self._consume_next()
File "/opt/venv/lib/python3.10/site-packages/google/cloud/spanner_v1/streamed.py", line 117, in _consume_next
response = next(self._response_iterator)
File "/opt/venv/lib/python3.10/site-packages/google/cloud/spanner_v1/snapshot.py", line 88, in _restart_on_unavailable
iterator = method(request=request)
File "/opt/venv/lib/python3.10/site-packages/google/cloud/spanner_v1/services/spanner/client.py", line 1444, in execute_streaming_sql
response = rpc(
File "/opt/venv/lib/python3.10/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
return wrapped_func(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/google/api_core/timeout.py", line 120, in func_with_timeout
return func(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 174, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
Hi @MostafaOmar98, Thanks for reporting this issue! Does this error only occur over the gRPC
transport? If not, can you share what error you get when using the REST
transport? You can set transport in the following way:
client = Client(..., transport="rest")
Heyy @ohmayr , thanks for your reply. I don't think the transport is publicly configurable, am I misunderstanding something?
I don't see the transport as a constructor field on the Client class and there is a comment that explicitly says that the cloud spanner api requires gRPC transport.
I can see that it is configurable on the internal SpannerClient class but this one is instantiated by the Database class and it is not configurable there either
@MostafaOmar98
Can you please refer to the internal bug and share the necessary information that has been asked for over there?