Requests after rpc_client reconnect fail with gate_closed_exception
nacl opened this issue · 4 comments
Describe the bug
After a connection times out and rpc_client::reconnect()
is called (as suggested by the logged exception), all future requests immediately fail, with the future containing a seastar::gate_closed_exception
.
This seems to be caused by rpc_client
's gate being closed when rpc_client::stop()
is called. Unfortunately, as of writing, seastar::gate
s cannot be reopened once closed, so I fixed this locally by wrapping the rpc_client
's dispatch_gate_
in a std::unique_ptr
, and resetting it whenever connect()
is invoked. There may be a better way to solve this problem.
To Reproduce
Steps to reproduce the behavior:
- Start an RPC server.
- Create an RPC client with options setting a connection timeout to 1s.
- Run the RPC client, and have the local shard sleep for 2s. After 1s, SMF will log something like this:
[shard 0] smf - rpc_client.cc:242] Handling exception: seastar::timed_out_error (timedout)
This line number may not match what's in the below-mentioned git hash.
If you were to make another request at this point, the error message suggests that I should reconnect()
.
- Invoke the RPC client's
reconnect()
method, and in its continuation, issue an RPC to the server. The RPC will fail. If the exception is logged, it will be aseastar::gate_closed_exception
.
Expected behavior
The RPC in step 4 should reach the server and (server willing) succeed..
Environment Info
OS: Ubuntu 16.04
Git hash: c7e8c58
Seastar git hash: f0c0855
Compiler: gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~16.04~ppa1)
Excellent report!!!
Will put a fix.
@nacl - not sure how you didn't run into the separate bugs when this fixed was put in place.
See #328 for what I think is a proper fix. Basically this would trigger a case when the server didn't properly gate the cleanup future. So there was a race to call the ~rpc_server_connection().
/cc @noahdesu
@nacl this is the output too
➜ debug git:(bugfix/client_reconnect) ./bin/smf_rpc_reconnect_with_timeout_integration_test -c 1 -m 2G
WARNING: debug mode. Not for benchmarking or production
WARN 2019-03-05 16:01:19,579 seastar - Seastar compiled with default allocator, heap profiler not supported
WARN 2019-03-05 16:01:19,597 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
INFO 2019-03-05 16:01:19,603 [shard 0] smf - rpc_server.cc:73] Starging server:rpc_server{args.ip=127.0.0.1, args.flags=1, args.rpc_port=29973, args.http_port=54447, rpc_routes=rpc_handle_router{rpc_service{name=SmfStorage,handles=rpc_service_method_handle{name=Get}}}, limits=rpc_connection_limits{max_mem:2.000 GB, max_body_parsing_duration: 60000ms, res_avail:2.000 GB (2147483648)}, incoming_filters=0, outgoing_filters=0}
INFO 2019-03-05 16:01:19,603 [shard 0] smf - rpc_server.cc:94] Starting rpc server
INFO 2019-03-05 16:01:19,604 [shard 0] smf - main.cc:65] Iteration: 0
INFO 2019-03-05 16:01:19,624 [shard 0] smf - rpc_client.cc:246] Handling exception: seastar::timed_out_error (timedout)
INFO 2019-03-05 16:01:19,624 [shard 0] smf - D rpc_client.cc:181] Disabling connection to: 127.0.0.1:29973
INFO 2019-03-05 16:01:19,624 [shard 0] smf - rpc_client.cc:194] Setting exceptional state for session: 1
INFO 2019-03-05 16:01:19,625 [shard 0] smf - main.cc:65] Iteration: 1
INFO 2019-03-05 16:01:19,627 [shard 0] smf - main.cc:77] Success. Reconnected and method returned. Server Iteration: 1
INFO 2019-03-05 16:01:19,627 [shard 0] smf - main.cc:65] Iteration: 2
INFO 2019-03-05 16:01:19,628 [shard 0] smf - main.cc:77] Success. Reconnected and method returned. Server Iteration: 1
INFO 2019-03-05 16:01:19,628 [shard 0] smf - D rpc_client.cc:181] Disabling connection to: 127.0.0.1:29973
INFO 2019-03-05 16:01:19,629 [shard 0] smf - rpc_server.cc:135] Stopped seastar::accept() calls
INFO 2019-03-05 16:01:19,629 [shard 0] smf - rpc_server.cc:123] Shutting down server with expected exit codes
INFO 2019-03-05 16:01:19,707 [shard 0] smf - D rpc_server.cc:261] Cannot send respond. client connection '1' is invalid. Skipping reply from server
ERROR 2019-03-05 16:01:19,708 [shard 0] smf - rpc_server.cc:285] There was an error with the connection: Error parsing connection header
INFO 2019-03-05 16:01:19,708 [shard 0] smf - rpc_server.cc:288] Closing connection for client: 127.0.0.1:51248
Meant to reply to this earlier:
@nacl - not sure how you didn't run into the separate bugs when this fixed was put in place.
See #328 for what I think is a proper fix. Basically this would trigger a case when the server didn't properly gate the cleanup future. So there was a race to call the ~rpc_server_connection().
Yeah, I unfortunately didn't really have time to dig too deeply into this.
Thanks for investigating and fixing this promptly. The change has been working well so far.