EventStore/EventStoreDB-Client-Dotnet-Legacy

Unrecoverable NotAuthenticatedException during cluster upgrade

megakid opened this issue · 1 comments

Describe the bug
We cannot reproduce this reliably but when upgrading our 3 node UAT clusters from V5 to V21, we noticed that some of our services - which we expected to reconnect automatically (as with a master failover) - started extreme spamming of logs, high CPU etc

It seems the clientside EventStoreConnection gets into state whereby the connection is marked as not authenticated (although the credentials have not changed during cluster rollout). From this state, the connection object is unrecoverable and needs recreating, we did this by a service restart (everything works after a restart).

We have noticed this behaviour in more than one service and across a couple of our clusters. An educated guess is that 10% of ES clients that we have performed the ES cluster upgrade on have suffered this issue, with the other 90% reconnecting perfectly and continuing to subscribe/read/append to streams.

To Reproduce
Steps to reproduce the behavior:

  1. Service running with persistent subscriptions
  2. Upgrade 3 node cluster to V21 by (as per v5 -> v21 upgrade notice) shutting down all nodes, rolling out v21 nodes + config (keep credentials the same)
  3. See that most of the time, the clients re-establish the connection whilst in the minority of times, they get into a clientside auth state which prevents recovery.

Expected behavior
Clients to reconnect without auth issues

Actual behavior
As above.

Config/Logs/Screenshots
Stack traces are from a few common operations:

EventStore.ClientAPI.Exceptions.NotAuthenticatedException: Not Authenticated
   at async Task<WriteResult> EventStore.ClientAPI.Internal.EventStoreNodeConnection.AppendToStreamAsync(string stream, long expectedVersion, IEnumerable<EventData> events, UserCredentials userCredentials)
EventStore.ClientAPI.Exceptions.NotAuthenticatedException: Not Authenticated
   at async Task<EventStorePersistentSubscriptionBase> EventStore.ClientAPI.EventStorePersistentSubscriptionBase.Start()

EventStore details

  • EventStore server version:
    21.10
  • Operating system:
    Windows
  • EventStore client version (if applicable):
    21.2.0

We think this is likely because we haven't set the RetryAuthenticationOnTimeout flag. I do think if DefaultUserCredentials are set, it should not allow the connection state to proceed to ConnectingPhase.Identification unless the ConnectingPhase.Authentication successfully completes.
Not asserting that means that transient errors (e.g. a timeout) that aren't surfaced to user code - except via AuthenticationFailed event - are silently ignored and cause unexpected, unrecoverably behaviour for the lifetime of the EventStore client object. The addition of RetryAuthenticationOnTimeout seems to mitigate one failure modes but, if I understand the current code correctly, if the server responds with NotAuthenticated, it still continues to connect.