Losing connection to sql server database is not always causing the lock to be lost
OskarKlintrot opened this issue ยท 11 comments
If I run my code (simplified below) with two different instances locally (dotnet run
or VS for the first and dotnet run --no-build
for the second) and restart my db a couple of times odds are quite high that eventually an instance will just keep going without the lock (in the while-loop). It could be as few restarts as 3-4 restarts. My understanding where that when you access handle.HandleLostToken.IsCancellationRequested
it will make sure that the lock still exists? Or is that not working together with I just tried without CancellationTokenSource.CreateLinkedTokenSource
?CancellationTokenSource.CreateLinkedTokenSource
and get the same result. Not sure what is going on...
using Medallion.Threading;
namespace Sample;
public class TestBackgroundService : BackgroundService
{
private readonly IServiceScopeFactory _serviceScopeFactory;
private readonly ILogger<TestBackgroundService> _logger;
public TestBackgroundService(IServiceScopeFactory serviceScopeFactory, ILogger<TestBackgroundService> logger)
{
_serviceScopeFactory = serviceScopeFactory;
_logger = logger;
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
await using var scope = _serviceScopeFactory.CreateAsyncScope();
var distributedLockProvider = scope.ServiceProvider.GetRequiredService<IDistributedLockProvider>();
_logger.LogInformation("Will try to acquire lock.");
await using var handle = await distributedLockProvider.AcquireLockAsync(
"a1416b2940b34bbb9189caaa13f11b1a",
cancellationToken: stoppingToken
);
_logger.LogInformation("Acquired lock.");
handle.HandleLostToken.Register(
() => _logger.LogError("Lost lock for job {Job Name}.", nameof(TestBackgroundService))
);
var stoppingCts = CancellationTokenSource.CreateLinkedTokenSource(
stoppingToken,
handle.HandleLostToken
);
if (stoppingCts.Token.IsCancellationRequested)
{
return;
}
while (!stoppingCts.IsCancellationRequested) // This evaluates to true sometimes even if the database has been restarted
{
_logger.LogInformation("Doing stuff.");
try
{
await Task.Delay(TimeSpan.FromSeconds(30), stoppingCts.Token);
}
catch (TaskCanceledException)
{ }
}
}
catch (Exception ex)
{
_logger.LogError(ex, "Something went wrong.");
}
}
}
}
Update: I can replicate it with just one instance (even if the log gets a bit tricker to follow then):
using Medallion.Threading;
namespace Samples;
public sealed class Job1 : JobBase
{
public Job1(IServiceScopeFactory serviceScopeFactory, ILogger<Job1> logger) : base(serviceScopeFactory, logger) { }
}
public sealed class Job2 : JobBase
{
public Job2(IServiceScopeFactory serviceScopeFactory, ILogger<Job2> logger) : base(serviceScopeFactory, logger) { }
}
public abstract class JobBase : BackgroundService
{
private readonly IServiceScopeFactory _serviceScopeFactory;
private readonly ILogger _logger;
public JobBase(IServiceScopeFactory serviceScopeFactory, ILogger logger)
{
_serviceScopeFactory = serviceScopeFactory;
_logger = logger;
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
await using var scope = _serviceScopeFactory.CreateAsyncScope();
var distributedLockProvider = scope.ServiceProvider.GetRequiredService<IDistributedLockProvider>();
_logger.LogInformation("Will try to acquire lock.");
await using var handle = await distributedLockProvider.AcquireLockAsync(
"a1416b2940b34bbb9189caaa13f11b1a",
cancellationToken: stoppingToken
);
_logger.LogInformation(
"Acquired {CancelableDescription} lock.",
handle.HandleLostToken.CanBeCanceled ? "cancelable" : "uncancelable"
);
await using var _ = handle.HandleLostToken.Register(() => _logger.LogError("Lost lock."));
using var stoppingCts = CancellationTokenSource.CreateLinkedTokenSource(
stoppingToken,
handle.HandleLostToken
);
if (stoppingCts.Token.IsCancellationRequested)
{
return;
}
while (!stoppingCts.IsCancellationRequested) // This evaluates to true sometimes even if the database has been restarted
{
_logger.LogInformation("Doing stuff.");
try
{
await Task.Delay(TimeSpan.FromSeconds(30), stoppingCts.Token);
}
catch (TaskCanceledException) { }
if (!stoppingToken.IsCancellationRequested)
{
_logger.LogInformation("Cancellation is not requested.");
}
}
}
catch (Exception exception)
{
_logger.LogError("Exception {Exception} thrown.", exception.GetType());
}
}
}
}
@OskarKlintrot thanks for your interest in the library. A few things to look into:
- The
HandleLostToken
is supposed to become canceled when it detects loss of connection with the database. This isn't always detected instantaneously. When you say that sometimes you aren't seeing the cancellation signal, how long are you waiting to see if the signal fires? - Just to validate, can you confirm that if you log
handle.HandleLostToken.CanBeCanceled
in the problem scenario it givestrue
? - Can you post the code for how you are constructing/registering your
IDistributedLockProvider
? - Just a note for the production code (may not be relevant): you should have
using
s for the call tohandle.HandleLostToken.Register
and thevar stoppingCts
since these are both disposable.
-
In the while-loop I have
Task.Delay
that waits 30 sec and since I pass the cancellation token to it, I would assume that sometime during those 30 sec it would cancel but it just loops back to the beginning of the loop. It seems to take 3-5 restarts before this happens. -
Yep, it logs as
true
.
.AddSingleton<IDistributedLockProvider>(provider =>
new SqlDistributedSynchronizationProvider(provider.GetRequiredService<IConfiguration>().GetConnectionString(connectionStringName))
)
- Oh, good catch!
I'll update my original post with your feedback.
Here's the log after 5-ish restarts:
02:17:52 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
02:17:52 info: Microsoft.Hosting.Lifetime[0] Hosting environment: Development
02:17:52 info: Microsoft.Hosting.Lifetime[0] Content root path: --------------------
02:17:53 info: Samples.Job1[0] Acquired cancelable lock.
02:17:53 info: Samples.Job1[0] Doing stuff.
02:18:23 info: Samples.Job1[0] Cancellation is not requested.
02:18:23 info: Samples.Job1[0] Doing stuff.
02:18:23 info: Samples.Job1[0] Cancellation is not requested.
02:18:23 fail: Samples.Job2[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:18:23 fail: Samples.Job1[0] Exception System.InvalidOperationException thrown.
02:18:23 info: Samples.Job1[0] Will try to acquire lock.
02:18:23 info: Samples.Job2[0] Will try to acquire lock.
02:18:31 info: Samples.Job2[0] Acquired cancelable lock.
02:18:31 info: Samples.Job2[0] Doing stuff.
02:18:41 info: Samples.Job2[0] Cancellation is not requested.
02:18:41 fail: Samples.Job1[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:18:42 info: Samples.Job1[0] Will try to acquire lock.
02:18:42 fail: Samples.Job2[0] Exception System.InvalidOperationException thrown.
02:18:42 info: Samples.Job2[0] Will try to acquire lock.
02:18:50 info: Samples.Job2[0] Acquired cancelable lock.
02:18:50 info: Samples.Job2[0] Doing stuff.
02:18:59 info: Samples.Job2[0] Cancellation is not requested.
02:18:59 fail: Samples.Job1[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:18:59 info: Samples.Job1[0] Will try to acquire lock.
02:18:59 fail: Samples.Job2[0] Exception System.InvalidOperationException thrown.
02:18:59 info: Samples.Job2[0] Will try to acquire lock.
02:19:07 info: Samples.Job2[0] Acquired cancelable lock.
02:19:07 info: Samples.Job2[0] Doing stuff.
02:19:18 info: Samples.Job2[0] Cancellation is not requested.
02:19:18 fail: Samples.Job1[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:19:18 info: Samples.Job1[0] Will try to acquire lock.
02:19:18 fail: Samples.Job2[0] Exception System.InvalidOperationException thrown.
02:19:18 info: Samples.Job2[0] Will try to acquire lock.
02:19:26 info: Samples.Job1[0] Acquired cancelable lock.
02:19:26 info: Samples.Job1[0] Doing stuff.
02:19:39 fail: Samples.Job2[0] Exception Microsoft.Data.SqlClient.SqlException thrown.
02:19:39 info: Samples.Job2[0] Will try to acquire lock.
02:19:48 info: Samples.Job2[0] Acquired cancelable lock.
02:19:48 info: Samples.Job2[0] Doing stuff.
02:19:56 info: Samples.Job1[0] Cancellation is not requested.
02:19:56 info: Samples.Job1[0] Doing stuff.
02:20:18 info: Samples.Job2[0] Cancellation is not requested.
02:20:18 info: Samples.Job2[0] Doing stuff.
Thanks for the updates @OskarKlintrot !
Just a few more questions:
- Can you drop a similar log output snapshot for a restart where everything goes right? I assume you see
Lost lock
in that case or no? - I'd love to see the full
ToString()
of the exceptions thrown in the log above (SQL and InvalidOperation). Can you include? - Can you confirm the version of
DistributedLock.SqlServer
you're using? - Can you confirm the version of
Microsoft.Data.SqlClient
that you're using? - Can you confirm the version of SQL Server that you're using?
- Can you confirm the version of .NET you're targeting (e.g.
net6.0
)? - Can you test to see whether disabling multiplexing in the lock provider options fixes the issue?
- I usually see
Lost lock
, not sure why but they disappered when I added theusing
. But the exceptions are the right ones:
SqlException
is when it loses the lock and InvalidOperationException
is when the connection closes when it tries to acquire the lock.
03:00:18 fail: Sample.App.Job2[0] Exception Microsoft.Data.SqlClient.SqlException thrown:
Microsoft.Data.SqlClient.SqlException (0x80131904): Cannot continue the execution because the session is in the kill state. A severe error occurred on the current command. The results, if any, should be discarded.
at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
at Microsoft.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader(Boolean isInternal, Boolean forDescribeParameterEncryption)
at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location ---
at Medallion.Threading.Internal.Data.DatabaseCommand.InternalExecuteAndPropagateCancellationAsync[TState,TResult](TState state, Func`3 executeAsync, CancellationToken cancellationToken, Boolean isConnectionMonitoringQuery) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\DatabaseCommand.cs:line 165
at Medallion.Threading.Internal.Data.DatabaseCommand.ExecuteAsync[TResult](Func`3 executeAsync, Func`2 executeSync, CancellationToken cancellationToken, Boolean disallowAsyncCancellation, Boolean isConnectionMonitoringQuery) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\DatabaseCommand.cs:line 92
at Medallion.Threading.SqlServer.SqlApplicationLock.ExecuteAcquireCommandAsync(DatabaseConnection connection, String lockName, TimeoutValue timeout, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.SqlServer\SqlApplicationLock.cs:line 64
at Medallion.Threading.SqlServer.SqlApplicationLock.Medallion.Threading.Internal.Data.IDbSynchronizationStrategy<System.Object>.TryAcquireAsync(DatabaseConnection connection, String resourceName, TimeoutValue timeout, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.SqlServer\SqlApplicationLock.cs:line 47
at Medallion.Threading.Internal.Data.MultiplexedConnectionLock.TryAcquireAsync[TLockCookie](String name, TimeoutValue timeout, IDbSynchronizationStrategy`1 strategy, TimeoutValue keepaliveCadence, CancellationToken cancellationToken, Boolean opportunistic)
at Medallion.Threading.Internal.Data.MultiplexedConnectionLock.TryAcquireAsync[TLockCookie](String name, TimeoutValue timeout, IDbSynchronizationStrategy`1 strategy, TimeoutValue keepaliveCadence, CancellationToken cancellationToken, Boolean opportunistic) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLock.cs:line 94
at Medallion.Threading.Internal.Data.MultiplexedConnectionLockPool.TryAcquireAsync[TLockCookie](String connectionString, String name, TimeoutValue timeout, IDbSynchronizationStrategy`1 strategy, TimeoutValue keepaliveCadence, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLockPool.cs:line 91
at Medallion.Threading.Internal.Data.MultiplexedConnectionLockPool.TryAcquireAsync[TLockCookie](String connectionString, String name, TimeoutValue timeout, IDbSynchronizationStrategy`1 strategy, TimeoutValue keepaliveCadence, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLockPool.cs:line 97
at Medallion.Threading.Internal.DistributedLockHelpers.Wrap[THandle](ValueTask`1 handleTask, Func`2 factory) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\DistributedLockHelpers.cs:line 41
at Medallion.Threading.Internal.DistributedLockHelpers.ThrowTimeoutIfNull[T](ValueTask`1 task, String object) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\DistributedLockHelpers.cs:line 142
at Medallion.Threading.Internal.Helpers.Convert[TDerived,TBase](ValueTask`1 task, ValueTaskConversion _) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Helpers.cs:line 24
at Sample.App.JobBase.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\OskarKlintrot\source\repos\Sample\Sample.App\Workers.cs:line 41
at Sample.App.JobBase.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\OskarKlintrot\source\repos\Sample\Sample.App\Workers.cs:line 65 ClientConnectionId:7b765ec6-eaef-4a74-b8ee-5214c20ed456 Error Number:596,State:1,Class:21
03:00:18 fail: Sample.App.Job1[0] Exception System.InvalidOperationException thrown:
System.InvalidOperationException: BeginExecuteNonQuery requires an open and available Connection. The connection's current state is closed.
at Microsoft.Data.SqlClient.SqlCommand.ValidateCommand(Boolean isAsync, String method)
at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String methodName)
at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryInternal(CommandBehavior behavior, AsyncCallback callback, Object stateObject, Int32 timeout, Boolean inRetry, Boolean asyncWrite)
at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryAsync(AsyncCallback callback, Object stateObject)
at System.Threading.Tasks.TaskFactory`1.FromAsyncImpl(Func`3 beginMethod, Func`2 endFunction, Action`1 endAction, Object state, TaskCreationOptions creationOptions)
at System.Threading.Tasks.TaskFactory`1.FromAsync(Func`3 beginMethod, Func`2 endMethod, Object state)
at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQueryAsync(CancellationToken cancellationToken) --- End of stack trace from previous location ---
at Medallion.Threading.Internal.Data.DatabaseCommand.ExecuteAsync[TResult](Func`3 executeAsync, Func`2 executeSync, CancellationToken cancellationToken, Boolean disallowAsyncCancellation, Boolean isConnectionMonitoringQuery) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\DatabaseCommand.cs:line 88
at Medallion.Threading.SqlServer.SqlApplicationLock.ExecuteReleaseCommandAsync(DatabaseConnection connection, String lockName, Boolean isTry) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.SqlServer\SqlApplicationLock.cs:line 71
at Medallion.Threading.Internal.Data.MultiplexedConnectionLock.ReleaseAsync[TLockCookie](IDbSynchronizationStrategy`1 strategy, String name, TLockCookie lockCookie) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLock.cs:line 159
at Medallion.Threading.Internal.Data.MultiplexedConnectionLock.ReleaseAsync[TLockCookie](IDbSynchronizationStrategy`1 strategy, String name, TLockCookie lockCookie) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.Core\Internal\Data\MultiplexedConnectionLock.cs:line 173
at Sample.App.JobBase.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\OskarKlintrot\source\repos\Sample\Sample.App\Workers.cs:line 65
at Sample.App.JobBase.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\OskarKlintrot\source\repos\Sample\Sample.App\Workers.cs:line 65
- 1.0.1
- 4.1.0
@@VERSION
:
Microsoft SQL Server 2019 (RTM) - 15.0.2000.5 (X64)
Sep 24 2019 13:48:23
Copyright (C) 2019 Microsoft Corporation
Developer Edition (64-bit) on Windows 10 Pro 10.0 (Build 22000: ) (Hypervisor)
net6.0
- That seems to fix the issue. I restarted 10 times and every single time both jobs throw exceptions and tried to get a new lock and never just kept going. I stopped the app, changed
.UseMultiplexing(false)
to.UseMultiplexing(true)
and then it just took 3 restarts until the issue come back.
I made a repo with a fully working sample: https://github.com/OskarKlintrot/DistributedLock-bug
Ok using your example code I was able to repro locally and I have a fix which seems to resolve the issue for me.
I've published a prerelease version with my fix. Can you install it locally and let me know whether or not this resolves the issue for you?
As for what I changed:
Basically in ConnectionMonitor.cs
we were mistakenly unsubscribing from state change events when the connection was merely closed but not yet disposed. Then, when the connection would re-open we wouldn't be able to track closing any more. The fix I made was this:
// If disposing, unsubscribe from state change tracking.
if (isDispose // PREVIOUSLY MISSING
&& this._stateChangedHandler != null
&& this._weakConnection.TryGetTarget(out var connection))
{
((DbConnection)connection.InnerConnection).StateChange -= this._stateChangedHandler;
}
I usually see Lost lock, not sure why but they disappered when I added the using.
I ran into this too. The reason you don't see it is that when the cancellation triggers it completes the Task.Delay
task which synchronously runs continuations until the next await. This causes us to exit the scope for the registration and clean it up before it gets to fire. Kind of like if we'd written:
var cts = new CancellationTokenSource();
var registration = cts.Token.Register(() => Console.WriteLine("here!"));
cts.Token.Register(() => registration.Dispose());
cts.Cancel(); // prints nothing!
I still think you'll generally want usings with your registrations; just have to be careful depending on the particular pattern!
I installed the prerelease version and restarted the server 10 times (for ($num = 1 ; $num -le 10 ; $num++){ net stop MSSQLSERVER; net start MSSQLSERVER; Start-Sleep -Seconds 10; Write-Host "Restarted $num times" }
) and everything seem to be working just as it should now!
As for what I changed:
Oh, that explains it! Nice catch!
I ran into this too.
Thanks for the explanation, which makes sense! What I did instead is basically this:
CancellationTokenRegistration? cancellationTokenRegistration = null;
try
{
cancellationTokenRegistration = handle.HandleLostToken.Register(() => _logger.LogError("Lost lock."));
}
catch (Exception exception)
{ }
finally
{
if (cancellationTokenRegistration.HasValue)
{
await cancellationTokenRegistration.Value.DisposeAsync();
}
}
I just finished a little experiment. I raised the time before retrying to require a lock to 10 sec so the jobs would take turn picking the lock. I also raised the time before restarting the SQL Server to 60 sec. Then I restarted the SQL Server 100 times and everything still looks perfectly fine:
(I used a static field to count the lost locks, hence 100 times)
What's the plan forward btw? Should I use the pre-release package or will you release a stable version soon? Or will you bump DistributedLock.SqlServer
?
Awesome thanks for the rigorous testing!
What's the plan forward btw?
I plan to release this as a stable patch fix to DL.Core and to all of the other SQL-based libraries (including DL.SqlServer). I may bundle in some other small fixes as well after reviewing the backlog. While I expect to have this out relatively soon (should have some time to prioritize this weekend), I'd suggest that you use the prerelease version in the meantime so that you are unblocked.
Awesome, thanks for a very quick fix! Looking forward to the stable patch fix! I'll use the prerelease until then.