madelson/DistributedLock

Losing connection to sql server database is not always causing the lock to be lost

OskarKlintrot opened this issue ยท 11 comments

If I run my code (simplified below) with two different instances locally (dotnet run or VS for the first and dotnet run --no-build for the second) and restart my db a couple of times odds are quite high that eventually an instance will just keep going without the lock (in the while-loop). It could be as few restarts as 3-4 restarts. My understanding where that when you access handle.HandleLostToken.IsCancellationRequested it will make sure that the lock still exists? Or is that not working together with CancellationTokenSource.CreateLinkedTokenSource? I just tried without CancellationTokenSource.CreateLinkedTokenSource and get the same result. Not sure what is going on...

using Medallion.Threading;

namespace Sample;

public class TestBackgroundService : BackgroundService
{
    private readonly IServiceScopeFactory _serviceScopeFactory;
    private readonly ILogger<TestBackgroundService> _logger;

    public TestBackgroundService(IServiceScopeFactory serviceScopeFactory, ILogger<TestBackgroundService> logger)
    {
        _serviceScopeFactory = serviceScopeFactory;
        _logger = logger;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            try
            {
                await using var scope = _serviceScopeFactory.CreateAsyncScope();

                var distributedLockProvider = scope.ServiceProvider.GetRequiredService<IDistributedLockProvider>();

                _logger.LogInformation("Will try to acquire lock.");

                await using var handle = await distributedLockProvider.AcquireLockAsync(
                    "a1416b2940b34bbb9189caaa13f11b1a",
                    cancellationToken: stoppingToken
                );

                _logger.LogInformation("Acquired lock.");

                handle.HandleLostToken.Register(
                    () => _logger.LogError("Lost lock for job {Job Name}.", nameof(TestBackgroundService))
                );

                var stoppingCts = CancellationTokenSource.CreateLinkedTokenSource(
                    stoppingToken,
                    handle.HandleLostToken
                );

                if (stoppingCts.Token.IsCancellationRequested)
                {
                    return;
                }

                while (!stoppingCts.IsCancellationRequested) // This evaluates to true sometimes even if the database has been restarted
                {
                    _logger.LogInformation("Doing stuff.");

                    try
                    {
                        await Task.Delay(TimeSpan.FromSeconds(30), stoppingCts.Token);
                    }
                    catch (TaskCanceledException)
                    { }
                }
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Something went wrong.");
            }
        }
    }
}

Update: I can replicate it with just one instance (even if the log gets a bit tricker to follow then):

using Medallion.Threading;

namespace Samples;

public sealed class Job1 : JobBase
{
    public Job1(IServiceScopeFactory serviceScopeFactory, ILogger<Job1> logger) : base(serviceScopeFactory, logger) { }
}

public sealed class Job2 : JobBase
{
    public Job2(IServiceScopeFactory serviceScopeFactory, ILogger<Job2> logger) : base(serviceScopeFactory, logger) { }
}

public abstract class JobBase : BackgroundService
{
    private readonly IServiceScopeFactory _serviceScopeFactory;
    private readonly ILogger _logger;

    public JobBase(IServiceScopeFactory serviceScopeFactory, ILogger logger)
    {
        _serviceScopeFactory = serviceScopeFactory;
        _logger = logger;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            try
            {
                await using var scope = _serviceScopeFactory.CreateAsyncScope();

                var distributedLockProvider = scope.ServiceProvider.GetRequiredService<IDistributedLockProvider>();

                _logger.LogInformation("Will try to acquire lock.");

                await using var handle = await distributedLockProvider.AcquireLockAsync(
                    "a1416b2940b34bbb9189caaa13f11b1a",
                    cancellationToken: stoppingToken
                );

                _logger.LogInformation(
                    "Acquired {CancelableDescription} lock.",
                    handle.HandleLostToken.CanBeCanceled ? "cancelable" : "uncancelable"
                );

                await using var _ = handle.HandleLostToken.Register(() => _logger.LogError("Lost lock."));

                using var stoppingCts = CancellationTokenSource.CreateLinkedTokenSource(
                    stoppingToken,
                    handle.HandleLostToken
                );

                if (stoppingCts.Token.IsCancellationRequested)
                {
                    return;
                }

                while (!stoppingCts.IsCancellationRequested) // This evaluates to true sometimes even if the database has been restarted
                {
                    _logger.LogInformation("Doing stuff.");

                    try
                    {
                        await Task.Delay(TimeSpan.FromSeconds(30), stoppingCts.Token);
                    }
                    catch (TaskCanceledException) { }

                    if (!stoppingToken.IsCancellationRequested)
                    {
                        _logger.LogInformation("Cancellation is not requested.");
                    }
                }
            }
            catch (Exception exception)
            {
                _logger.LogError("Exception {Exception} thrown.", exception.GetType());
            }
        }
    }
}

@OskarKlintrot thanks for your interest in the library. A few things to look into:

  1. The HandleLostToken is supposed to become canceled when it detects loss of connection with the database. This isn't always detected instantaneously. When you say that sometimes you aren't seeing the cancellation signal, how long are you waiting to see if the signal fires?
  2. Just to validate, can you confirm that if you log handle.HandleLostToken.CanBeCanceled in the problem scenario it gives true?
  3. Can you post the code for how you are constructing/registering your IDistributedLockProvider?
  4. Just a note for the production code (may not be relevant): you should have usings for the call to handle.HandleLostToken.Register and the var stoppingCts since these are both disposable.
  1. In the while-loop I have Task.Delay that waits 30 sec and since I pass the cancellation token to it, I would assume that sometime during those 30 sec it would cancel but it just loops back to the beginning of the loop. It seems to take 3-5 restarts before this happens.

  2. Yep, it logs as true.

.AddSingleton<IDistributedLockProvider>(provider =>
    new SqlDistributedSynchronizationProvider(provider.GetRequiredService<IConfiguration>().GetConnectionString(connectionStringName))
)
  1. Oh, good catch!

I'll update my original post with your feedback.

Here's the log after 5-ish restarts:

02:17:52 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
02:17:52 info: Microsoft.Hosting.Lifetime[0] Hosting environment: Development
02:17:52 info: Microsoft.Hosting.Lifetime[0] Content root path: --------------------
02:17:53 info: Samples.Job1[0] Acquired cancelable lock.
02:17:53 info: Samples.Job1[0] Doing stuff.
02:18:23 info: Samples.Job1[0] Cancellation is not requested.
02:18:23 info: Samples.Job1[0] Doing stuff.
02:18:23 info: Samples.Job1[0] Cancellation is not requested.
02:18:23 fail: Samples.Job2[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:18:23 fail: Samples.Job1[0] Exception System.InvalidOperationException thrown.
02:18:23 info: Samples.Job1[0] Will try to acquire lock.
02:18:23 info: Samples.Job2[0] Will try to acquire lock.
02:18:31 info: Samples.Job2[0] Acquired cancelable lock.
02:18:31 info: Samples.Job2[0] Doing stuff.
02:18:41 info: Samples.Job2[0] Cancellation is not requested.
02:18:41 fail: Samples.Job1[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:18:42 info: Samples.Job1[0] Will try to acquire lock.
02:18:42 fail: Samples.Job2[0] Exception System.InvalidOperationException thrown.
02:18:42 info: Samples.Job2[0] Will try to acquire lock.
02:18:50 info: Samples.Job2[0] Acquired cancelable lock.
02:18:50 info: Samples.Job2[0] Doing stuff.
02:18:59 info: Samples.Job2[0] Cancellation is not requested.
02:18:59 fail: Samples.Job1[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:18:59 info: Samples.Job1[0] Will try to acquire lock.
02:18:59 fail: Samples.Job2[0] Exception System.InvalidOperationException thrown.
02:18:59 info: Samples.Job2[0] Will try to acquire lock.
02:19:07 info: Samples.Job2[0] Acquired cancelable lock.
02:19:07 info: Samples.Job2[0] Doing stuff.
02:19:18 info: Samples.Job2[0] Cancellation is not requested.
02:19:18 fail: Samples.Job1[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:19:18 info: Samples.Job1[0] Will try to acquire lock.
02:19:18 fail: Samples.Job2[0] Exception System.InvalidOperationException thrown.
02:19:18 info: Samples.Job2[0] Will try to acquire lock.
02:19:26 info: Samples.Job1[0] Acquired cancelable lock.
02:19:26 info: Samples.Job1[0] Doing stuff.
02:19:39 fail: Samples.Job2[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:19:39 info: Samples.Job2[0] Will try to acquire lock.
02:19:48 info: Samples.Job2[0] Acquired cancelable lock.
02:19:48 info: Samples.Job2[0] Doing stuff.
02:19:56 info: Samples.Job1[0] Cancellation is not requested.
02:19:56 info: Samples.Job1[0] Doing stuff.
02:20:18 info: Samples.Job2[0] Cancellation is not requested.
02:20:18 info: Samples.Job2[0] Doing stuff.

Thanks for the updates @OskarKlintrot !

Just a few more questions:

  1. Can you drop a similar log output snapshot for a restart where everything goes right? I assume you see Lost lock in that case or no?
  2. I'd love to see the full ToString() of the exceptions thrown in the log above (SQL and InvalidOperation). Can you include?
  3. Can you confirm the version of DistributedLock.SqlServer you're using?
  4. Can you confirm the version of Microsoft.Data.SqlClient that you're using?
  5. Can you confirm the version of SQL Server that you're using?
  6. Can you confirm the version of .NET you're targeting (e.g. net6.0)?
  7. Can you test to see whether disabling multiplexing in the lock provider options fixes the issue?
  1. I usually see Lost lock, not sure why but they disappered when I added the using. But the exceptions are the right ones:

image

SqlException is when it loses the lock and InvalidOperationException is when the connection closes when it tries to acquire the lock.

03:00:18 fail: Sample.App.Job2[0] Exception Microsoft.Data.SqlClient.SqlException thrown:
Microsoft.Data.SqlClient.SqlException (0x80131904): Cannot continue the execution because the session is in the kill state. A severe error occurred on the current command.  The results, if any, should be discarded.
    at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
    at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
    at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
    at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
    at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
    at Microsoft.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader(Boolean isInternal, Boolean forDescribeParameterEncryption)
    at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
    at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
    at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
    at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location ---
    at Medallion.Threading.Internal.Data.DatabaseCommand.InternalExecuteAndPropagateCancellationAsync[TState,TResult](TState state, Func`3 executeAsync, CancellationToken cancellationToken, Boolean isConnectionMonitoringQuery) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\DatabaseCommand.cs:line 165
    at Medallion.Threading.Internal.Data.DatabaseCommand.ExecuteAsync[TResult](Func`3 executeAsync, Func`2 executeSync, CancellationToken cancellationToken, Boolean disallowAsyncCancellation, Boolean isConnectionMonitoringQuery) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\DatabaseCommand.cs:line 92
    at Medallion.Threading.SqlServer.SqlApplicationLock.ExecuteAcquireCommandAsync(DatabaseConnection connection, String lockName, TimeoutValue timeout, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.SqlServer\SqlApplicationLock.cs:line 64
    at Medallion.Threading.SqlServer.SqlApplicationLock.Medallion.Threading.Internal.Data.IDbSynchronizationStrategy<System.Object>.TryAcquireAsync(DatabaseConnection connection, String resourceName, TimeoutValue timeout, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.SqlServer\SqlApplicationLock.cs:line 47
    at Medallion.Threading.Internal.Data.MultiplexedConnectionLock.TryAcquireAsync[TLockCookie](String name, TimeoutValue timeout, IDbSynchronizationStrategy`1 strategy, TimeoutValue keepaliveCadence, CancellationToken cancellationToken, Boolean opportunistic)
    at Medallion.Threading.Internal.Data.MultiplexedConnectionLock.TryAcquireAsync[TLockCookie](String name, TimeoutValue timeout, IDbSynchronizationStrategy`1 strategy, TimeoutValue keepaliveCadence, CancellationToken cancellationToken, Boolean opportunistic) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLock.cs:line 94
    at Medallion.Threading.Internal.Data.MultiplexedConnectionLockPool.TryAcquireAsync[TLockCookie](String connectionString, String name, TimeoutValue timeout, IDbSynchronizationStrategy`1 strategy, TimeoutValue keepaliveCadence, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLockPool.cs:line 91
    at Medallion.Threading.Internal.Data.MultiplexedConnectionLockPool.TryAcquireAsync[TLockCookie](String connectionString, String name, TimeoutValue timeout, IDbSynchronizationStrategy`1 strategy, TimeoutValue keepaliveCadence, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLockPool.cs:line 97
    at Medallion.Threading.Internal.DistributedLockHelpers.Wrap[THandle](ValueTask`1 handleTask, Func`2 factory) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\DistributedLockHelpers.cs:line 41
    at Medallion.Threading.Internal.DistributedLockHelpers.ThrowTimeoutIfNull[T](ValueTask`1 task, String object) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\DistributedLockHelpers.cs:line 142
    at Medallion.Threading.Internal.Helpers.Convert[TDerived,TBase](ValueTask`1 task, ValueTaskConversion _) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Helpers.cs:line 24
    at Sample.App.JobBase.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\OskarKlintrot\source\repos\Sample\Sample.App\Workers.cs:line 41
    at Sample.App.JobBase.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\OskarKlintrot\source\repos\Sample\Sample.App\Workers.cs:line 65 ClientConnectionId:7b765ec6-eaef-4a74-b8ee-5214c20ed456 Error Number:596,State:1,Class:21
03:00:18 fail: Sample.App.Job1[0] Exception System.InvalidOperationException thrown:
System.InvalidOperationException: BeginExecuteNonQuery requires an open and available Connection. The connection's current state is closed.
    at Microsoft.Data.SqlClient.SqlCommand.ValidateCommand(Boolean isAsync, String method)
    at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String methodName)
    at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryInternal(CommandBehavior behavior, AsyncCallback callback, Object stateObject, Int32 timeout, Boolean inRetry, Boolean asyncWrite)
    at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryAsync(AsyncCallback callback, Object stateObject)
    at System.Threading.Tasks.TaskFactory`1.FromAsyncImpl(Func`3 beginMethod, Func`2 endFunction, Action`1 endAction, Object state, TaskCreationOptions creationOptions)
    at System.Threading.Tasks.TaskFactory`1.FromAsync(Func`3 beginMethod, Func`2 endMethod, Object state)
    at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQueryAsync(CancellationToken cancellationToken) --- End of stack trace from previous location ---
    at Medallion.Threading.Internal.Data.DatabaseCommand.ExecuteAsync[TResult](Func`3 executeAsync, Func`2 executeSync, CancellationToken cancellationToken, Boolean disallowAsyncCancellation, Boolean isConnectionMonitoringQuery) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\DatabaseCommand.cs:line 88
    at Medallion.Threading.SqlServer.SqlApplicationLock.ExecuteReleaseCommandAsync(DatabaseConnection connection, String lockName, Boolean isTry) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.SqlServer\SqlApplicationLock.cs:line 71
    at Medallion.Threading.Internal.Data.MultiplexedConnectionLock.ReleaseAsync[TLockCookie](IDbSynchronizationStrategy`1 strategy, String name, TLockCookie lockCookie) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLock.cs:line 159
    at Medallion.Threading.Internal.Data.MultiplexedConnectionLock.ReleaseAsync[TLockCookie](IDbSynchronizationStrategy`1 strategy, String name, TLockCookie lockCookie) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLock.cs:line 173
    at Sample.App.JobBase.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\OskarKlintrot\source\repos\Sample\Sample.App\Workers.cs:line 65
    at Sample.App.JobBase.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\OskarKlintrot\source\repos\Sample\Sample.App\Workers.cs:line 65
  1. 1.0.1
  2. 4.1.0
  3. @@VERSION:

Microsoft SQL Server 2019 (RTM) - 15.0.2000.5 (X64)
Sep 24 2019 13:48:23
Copyright (C) 2019 Microsoft Corporation
Developer Edition (64-bit) on Windows 10 Pro 10.0 (Build 22000: ) (Hypervisor)

  1. net6.0
  2. That seems to fix the issue. I restarted 10 times and every single time both jobs throw exceptions and tried to get a new lock and never just kept going. I stopped the app, changed .UseMultiplexing(false) to .UseMultiplexing(true) and then it just took 3 restarts until the issue come back.

I made a repo with a fully working sample: https://github.com/OskarKlintrot/DistributedLock-bug

Ok using your example code I was able to repro locally and I have a fix which seems to resolve the issue for me.

I've published a prerelease version with my fix. Can you install it locally and let me know whether or not this resolves the issue for you?

As for what I changed:

Basically in ConnectionMonitor.cs we were mistakenly unsubscribing from state change events when the connection was merely closed but not yet disposed. Then, when the connection would re-open we wouldn't be able to track closing any more. The fix I made was this:

                // If disposing, unsubscribe from state change tracking.
                if (isDispose // PREVIOUSLY MISSING
                    && this._stateChangedHandler != null
                    && this._weakConnection.TryGetTarget(out var connection))
                {
                    ((DbConnection)connection.InnerConnection).StateChange -= this._stateChangedHandler;
                }

I usually see Lost lock, not sure why but they disappered when I added the using.

I ran into this too. The reason you don't see it is that when the cancellation triggers it completes the Task.Delay task which synchronously runs continuations until the next await. This causes us to exit the scope for the registration and clean it up before it gets to fire. Kind of like if we'd written:

var cts = new CancellationTokenSource();
var registration = cts.Token.Register(() => Console.WriteLine("here!"));
cts.Token.Register(() => registration.Dispose());
cts.Cancel(); // prints nothing!

I still think you'll generally want usings with your registrations; just have to be careful depending on the particular pattern!

I installed the prerelease version and restarted the server 10 times (for ($num = 1 ; $num -le 10 ; $num++){ net stop MSSQLSERVER; net start MSSQLSERVER; Start-Sleep -Seconds 10; Write-Host "Restarted $num times" }) and everything seem to be working just as it should now!

As for what I changed:

Oh, that explains it! Nice catch!

I ran into this too.

Thanks for the explanation, which makes sense! What I did instead is basically this:

CancellationTokenRegistration? cancellationTokenRegistration = null;

try
{
    cancellationTokenRegistration = handle.HandleLostToken.Register(() => _logger.LogError("Lost lock."));
}
catch (Exception exception)
{ }
finally
{
    if (cancellationTokenRegistration.HasValue)
    {
        await cancellationTokenRegistration.Value.DisposeAsync();
    }
}

I just finished a little experiment. I raised the time before retrying to require a lock to 10 sec so the jobs would take turn picking the lock. I also raised the time before restarting the SQL Server to 60 sec. Then I restarted the SQL Server 100 times and everything still looks perfectly fine:

image

(I used a static field to count the lost locks, hence 100 times)

What's the plan forward btw? Should I use the pre-release package or will you release a stable version soon? Or will you bump DistributedLock.SqlServer?

Awesome thanks for the rigorous testing!

What's the plan forward btw?

I plan to release this as a stable patch fix to DL.Core and to all of the other SQL-based libraries (including DL.SqlServer). I may bundle in some other small fixes as well after reviewing the backlog. While I expect to have this out relatively soon (should have some time to prioritize this weekend), I'd suggest that you use the prerelease version in the meantime so that you are unblocked.

Awesome, thanks for a very quick fix! Looking forward to the stable patch fix! I'll use the prerelease until then.