fiskaltrust/product-de-bring-your-own-datacenter

Failed to acquire lock after 00:01:00. Timeout - RedisLockService

dhuesmann89 opened this issue ยท 12 comments

Describe the bug
I know that here have already been issues with this exact error message but solutions mentioned there didn't help unfortunately:
The byodc pod logs the following error after a long time: "Failed to acquire lock after 00:01:00. Timeout".
It is not caused by any memory constraints nor are there any exports running (that we can see).
What else could cause that problem? The pod starts correctly and is in the state ready so it can connect to the redis cluster.

STDOUT/STDERR

Failed method: fiskaltrust.SignatureCloud.DE.Services.Locking.RedisLockService+<ProcessLockiningAsync>d__3`1.MoveNext

System.TimeoutException:
   at fiskaltrust.SignatureCloud.DE.Services.Locking.RedisLockService+<ProcessLockiningAsync>d__3`1.MoveNext (fiskaltrust.SignatureCloud.DE, Version=1.3.42.0, Culture=neutral, PublicKeyToken=null: /src/src/fiskaltrust.SignatureCloud.DE/Services/Locking/RedisLockService.cs:33)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at fiskaltrust.SignatureCloud.DE.Services.MiddlewareInstanceFactory+<CreateAsync>d__10.MoveNext (fiskaltrust.SignatureCloud.DE, Version=1.3.42.0, Culture=neutral, PublicKeyToken=null: /src/src/fiskaltrust.SignatureCloud.DE/Services/MiddlewareInstanceFactory.cs:85)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at fiskaltrust.SignatureCloud.DE.Controllers.SignController+<Post>d__6.MoveNext (fiskaltrust.SignatureCloud.DE, Version=1.3.42.0, Culture=neutral, PublicKeyToken=null: /src/src/fiskaltrust.SignatureCloud.DE/Controllers/SignController.cs:90)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at lambda_method12 (Anonymously Hosted DynamicMethods Assembly, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor+AwaitableObjectResultExecutor+<Execute>d__0.MoveNext (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Threading.Tasks.ValueTask`1.get_Result (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker+<<InvokeActionMethodAsync>g__Awaited|12_0>d.MoveNext (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker+<<InvokeNextActionFilterAsync>g__Awaited|10_0>d.MoveNext (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker+<<InvokeInnerFilterAsync>g__Awaited|13_0>d.MoveNext (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker+<<InvokeFilterPipelineAsync>g__Awaited|19_0>d.MoveNext (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker+<<InvokeAsync>g__Awaited|17_0>d.MoveNext (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Routing.EndpointMiddleware+<<Invoke>g__AwaitRequestTask|6_0>d.MoveNext (Microsoft.AspNetCore.Routing, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware+<Invoke>d__6.MoveNext (Microsoft.AspNetCore.Authorization.Policy, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Builder.Extensions.MapWhenMiddleware+<Invoke>d__3.MoveNext (Microsoft.AspNetCore.Http.Abstractions, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Builder.Extensions.MapWhenMiddleware+<Invoke>d__3.MoveNext (Microsoft.AspNetCore.Http.Abstractions, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol+<ProcessRequests>d__223`1.MoveNext (Microsoft.AspNetCore.Server.Kestrel.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)

Cashbox Information (please complete the following information):

  • 1ede12e0-19d9-4d0c-b2be-91a6965eabe1

Additional context
Sandbox

Hey, this can occur when a pod crashes. It's part of the mechanism that makes sure no two pods are running the same cashbox worker.

This error is nothing problematic in itsself.

The interesting thing is if and why one of the pods crashed. Can you verify if that happened?

If no pod crashed this can also happen if the header based routing of the loadbalancer is not configured/working correctly. Can you share the loadbalancer and emissary-ingress section of your helm config.yaml?

How often does this error occur?

Hi, thanks for your quick response!
There have been no pod restarts (neither byodc nor redis pods).
The issue currently happens everytime we make a request against the middleware on our test system.
We use the default values for the loadbalancer and emissary-ingress sections. All services from the chart are only available inside the cluster. Requests are only coming from a different application running in Kubernetes which in turn can be reached via nginx ingress controller.

Hey,

Does this happen only with one cashbox or all?

Can you try sending an echo null to reset the cashbox.

Hi @volllly,

We're experiencing the same issue in our test environment at the moment - it started after the upgrade to newer chart/image versions (1.3.38/1.3.44-buster respectively).

Since we have our own instance of Redis, I tried upgrading that as well (to 6.2) and I also tried redeploying the stack using the built-in Redis service rather than our own. I also tried manually rebuilding cashbox configuration, restarting & clearing Redis, and sending empty echo requests. None of this seemed to help.

Since you mentioned the loadbalancer, I tried investigating that a little but it seems the requests I send always go to the same pod for a given cashbox id.

Our config.yml looks like this at the moment:

                byodc:
                  image:
                    tag: 1.3.44-buster
                  config:
                    replicas:  2
                    logLevel: Debug
                    helipad:
                      baseUrl: https://helipad-sandbox.fiskaltrust.cloud
                    limits:
                      cpu: 225m
                      memory: 150Mi

                loadbalancer:
                  config:
                    hostname: app.byodc.fiskaltrust.internal.dev

                emissary-ingress:
                  replicaCount: 2
                  service:
                    annotations:
                      service.beta.kubernetes.io/aws-load-balancer-internal: "true"
                  adminService:
                    loadBalancerSourceRanges:
                        - 10.3.0.0/16
                        - 192.168.0.0/17

                  livenessProbe:
                    initialDelaySeconds : 60
                    periodSeconds: 10
                  readinessProbe:
                    initialDelaySeconds : 120
                    periodSeconds: 10

IDK if it's related but apart from the exception mentioned above by @dhuesmann89, we also see this message in our logs:

\u001b[40m\u001b[1m\u001b[33mwarn\u001b[39m\u001b[22m\u001b[49m: fiskaltrust.SignatureCloud.DE.Services.HelipadUploader[0]\n      Failed to remove queue upload for cashbox cashbox. cashboxid: 9913c9f9-c2e8-46e0-9c12-4d26129ce684\n"

This one actually seems to show up on a different pod than the requests are processed at.

Small update: I simplified our config a bit to make sure we're not messing anything up and recreated the entire namespace and it's still not working (same errors, all on same pod again).

                byodc:
                  image:
                    tag: 1.3.44-buster
                  config:
                    replicas:  2
                    logLevel: Debug
                    helipad:
                      baseUrl: https://helipad-sandbox.fiskaltrust.cloud
                    limits:
                      cpu: 225m
                      memory: 150Mi

                emissary-ingress:
                  replicaCount: 2

Edit: I now tried this with 1.3.45-rc1-bullseye too and it doesn't work as well.

Edit 2:
I also tried increasing the resources for the byodc pod (1500m, 2048Mi) and it didn't help either.
I'm currently thinking that the error message is misleading as I can see the lock key appearing and disappearing during the processing of the request -> IDK how it's coded on your side but could it be that the lock is acquired correctly but there's something else that fails to be processed during that time?

Edit 3:
IDK if this is due to using different versions or it's actually 2 different issues but I noticed that while the exception message is the same, our requests get blocked on a different method than in @dhuesmann89 's case:

Microsoft.AspNetCore.Server.Kestrel[13]
Connection id \"0HMP9BV6V7E0G\", Request id \"0HMP9BV6V7E0G:00000002\": An unhandled exception was thrown by the application.
System.TimeoutException: Failed to acquire lock after 00:01:00. Timeout
at fiskaltrust.SignatureCloud.DE.Services.Locking.RedisLockService.ProcessLockiningAsync[T](String lockName, Func`1 executeMethod) in /src/src/fiskaltrust.SignatureCloud.DE/Services/Locking/RedisLockService.cs:line 33
at fiskaltrust.SignatureCloud.DE.Services.MiddlewareInstanceFactory.CreateAsync(Guid cashboxid, String accesstoken, ftCashBoxConfiguration configuration) in /src/src/fiskaltrust.SignatureCloud.DE/Services/MiddlewareInstanceFactory.cs:line 85
at fiskaltrust.SignatureCloud.DE.Controllers.SignController.Post(ReceiptRequest request) in /src/src/fiskaltrust.SignatureCloud.DE/Controllers/SignController.cs:line 90

Hey, thanks for the detailed report ๐Ÿ™

I have a few followups:

What was the latest version which did not experience this issue?

Please set the byodc.config.logLevel key to Trace in your config and send us the logs from all pyodc pods when this error occurs (including the logs ca 5 minutes before and after the error occurs).

Does this error occur every time when sending a receipt or only sometimes and does it happen with all cashboxes or only some?

Hey!

We didn't check all of the possible combinations. We started from 1.3.24 (this works), went to 1.3.44, then a few versions below (maybe 3 or 4), and then also tried 1.3.45.

Here are the trace logs. I only sent 2 requests after redeploying, so there's not a whole lot more in there:
logs-insights-results.csv

This is the other instance for reference:
logs-insights-results (1).csv

The error occurs every time for all cashboxes.

Hey, can you try using 1.3.45-rc1-bullseye scaling down to one replica, sending an echo null and then a check if the error still occurs?

byodc:
  image:
    tag: 1.3.45-rc1-bullseye
  config:
    replicas: 1

Yup, the error is still there:
log-events-viewer-result.csv

@volllly I had a bit more time to investigate:

  • I went back to image tag 1.3.24-buster which works fine with helm chart 1.3.38
  • I then started upgrading 1 minor version at time (only 1.3.xx-buster tags)
  • the problem started occurring when upgrading from 1.3.33-buster to 1.3.37-buster
    • clearing the cache or sending an echo null didn't help
  • after rolling back to 1.3.33-buster it didn't work immediately, I had to clear the cache and rebuild the cashbox with echo null

Hi again @volllly ,

I was able to find the issue and resolve it.

In this release you upgraded the MySQL connector version to 2.1.0:
https://github.com/fiskaltrust/middleware/releases/tag/queue%2Fmysql%2Fv1.3.36

This version is not compatible out-of-the-box with Aurora MySQL 5.7 that we were using:
https://mysqlconnector.net/troubleshooting/aurora-freeze/

Since adapting the proposed workaround would mean having to change every single connection string, we decided to upgrade to Aurora MySQL 8 which works fine without extra connection parameters.

We seem to be fine for now but I think there are some worthwhile improvements for the future:

  • put information about this in some guide or troubleshooting page
  • improve your connection mechanism and logs to make it clear a database connection is failing (especially that it now shows an error about Redis locks)
  • consider officially supporting Aurora and thus taking things like this into consideration (or making it clear you're not supporting it)

Cheers!
Grzegorz

Hey @gregzip, great that you could find the issue ๐Ÿ˜

Thanks for the valuable feedback.

Our takeaways are that:

  • we'll look into why this error was not handled correctly and improve the situation
  • we'll better document software and version requirements of byodc and add changes to the release notes.