Failover does not properly work on read operations.

Question

Failover does not properly work on read operations.

dpasek-senacor opened this issue 3 years ago · 0 comments

This is a regression caused by the implementation of the reactive stream support for the stream read operations. (Shame on me. :-) )

Situation

EventStoreDBClient performs a read stream operation and the underlying GrpcClient receives:

a StatusRuntimeExceptionwith the code UNAVAILABLE or ABORTED
a NotLeaderException

Expected bevahior:

The existing gRPC connection (aka the ManagedChannel) is dropped a new conection to a node is created. In case of the NotLeaderException the new connection should be to the node specified in the Exception.

Current behavior:

The exceptions are not received by the GrpcClient and therefore the reconnect is not triggered properly.

Impact

After a leader change in a cluster the client will try to reconnect to the old node and will not switch to the new master, causing operations to fail.

Root cause

Inside the implementation of AbstractRead the error signals of the gRPC connection are only forward to the reactive subscriber but not to the GrpcClient since the result of AbstractRead is always a successfully completed CompletableFuture containing the subscription.
It is necessary to forward the error signals to the reactive subscriber and provide a CompletableFuture for the GrpcClient to receive error signals.