Client Session/Tcptransport clones of session_state become out of sync if the server restarts
Opened this issue · 1 comments
When a client is connected to a peer, if the peer disconnects, the reconnect logic calls 'reconnect_and_activate', which calls 'reset()', which creates a new session_state, but the session.transport is not updated with the new session state, leading to inconsistencies.
I have a patch for this here. By adding a Tcptransport::reset() that updates the session_state (and the dependent message_queue and connection_state.
The bug can be demonstrated by commenting out session.rs:223. Running simple-client against simple-server, with line 223 commented, when restarting the server, the client will panic:
Created a subscription with id = 1
Data change from server:
Item "ns=2;s=v4", Value = Double(-0.893559520249049)
Item "ns=2;s=v1", Value = Int32(28)
Item "ns=2;s=v2", Value = Boolean(true)
Item "ns=2;s=v3", Value = String(UAString { value: Some("Hello World times 1") })
Data change from server:
Item "ns=2;s=v2", Value = Boolean(false)
Item "ns=2;s=v4", Value = Double(0.21200710992205463)
Item "ns=2;s=v3", Value = String(UAString { value: Some("Hello World times 2") })
Item "ns=2;s=v1", Value = Int32(35)
Data change from server:
Item "ns=2;s=v2", Value = Boolean(true)
Item "ns=2;s=v1", Value = Int32(42)
Item "ns=2;s=v4", Value = Double(0.9950138797069145)
Item "ns=2;s=v3", Value = String(UAString { value: Some("Hello World times 3") })
thread 'main' panicked at 'assertion failed: self.session_state.read().id() == self.transport.session_id()', lib/src/client/session/session.rs:225:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
While this patch seems to work, it exposes a large issue. As the reconnect logic progresses, it eventually deadlocks on session.rs:1442 : SessionService::create_session::spawn_session_activity_task, which tries to access the runtime, but cannot because the TcpTransport::connect has spawned a thread that locks the runtime during the life of the connection (spawn_looping_tasks). For that reason I currently have the transport reset disabled so that the client can be restarted externally.
If there are suggestions on how to work around this I'm happy to test.
possibly related to #189