dotnet/orleans

Multiple grain exceptions lead to deadlock

ins4n333 opened this issue · 6 comments

HI.
We occasionally encounter issues with Silo hanging, where it stops responding to connections from other silos and clients.

During the analysis of a memory dump taken from the hung process, we observe something resembling a deadlock (Orleans version 8.2). When handling and processing multiple simultaneously occurring exceptions (via ex.ToString()) in different grains, multiple calls are made to SR.InternalGetResourceString (as far as I understand, this is a feature of .NET), which in turn invokes Monitor.Enter. This results in multiple grains being blocked by a shared resource.

I am attaching a task graph that shows this hanging behavior. The process hangs indefinitely, and only a manual restart helps. Could you please comment on this?
full

This is new. From my initial look, it's not clear that this is related to Orleans, but I want to help you to figure it out anyway. Are you share a snapshot of the Threads view from Parallel Stacks? That may make the situation clearer.

You can find parallel threads on the screen below: pthreads

@ins4n333 the second screenshot does not seem to show a deadlock.
EDIT: to clarify, what I'm looking for is the Parallel Stacks > Threads view
image

Yes, this is exactly what I sent you. Еhe Tasks pane in Visual Studio shows a kind of deadlock when multiple tasks access "SR Monitor". However, this is not visible in the Thread View.

I'll try to explain our problem in more detail. We have been using the actor model in this service based on Service Fabric for quite some time. We are now planning to transition to Kubernetes and replace our actors with Orleans grains. In our system, there is only one actor class performing several different operations. These operations are quite heavy, typically involving 5-10 IO calls to external systems/databases per operation (avg duration of each call ~ 300-500 ms). However, grains do not call each other. We have not had any problems with Service Fabric actors.

Upon switching to Orleans, we encountered many strange issues:

  1. IamAlive probes: These probes stop being sent for active silos. We tried Azure Tables and Cassandra for clustering (both using your package and our own implementation).

  2. Sometimes nodes stuck in DEAD status without any restart and continue listening Silo/Proxy ports (I checked it by sending tcp ping to orleans ports)

  3. Node crashes under load: On our test stands, nodes occasionally crash with errors like:

    Description: The process was terminated due to an unhandled exception.
    Exception Info: System.ApplicationException: ERROR_ABANDONED_WAIT_0 (0x800702DF)
    at System.Threading.PortableThreadPool.IOCompletionPoller.Poll()

    Application: CoreActorService.exe
    Description: The process was terminated due to an unhandled exception.
    Exception Info: System.IO.IOException: The handle is invalid.
    at System.Threading.EventWaitHandle.Set()
    at System.Threading.TimerQueue.SetTimer(UInt32 actualDuration)
    at System.Threading.TimerQueue.EnsureTimerFiresBy(UInt32 requestedDuration)
    at System.Threading.TimerQueue.FireNextTimers()
    at System.Threading.ThreadPoolWorkQueue.Dispatch()
    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

  4. Strange network errors (System.Net.Http.HttpRequestException: An operation was attempted on something that is not a socket. ) within grains: Errors like these do not occur under the same load when using Service Fabric actors, as if the Orleans context imposes restrictions on network calls:

    ...Ucp.Gateways.DeviceInventory.Contracts.DeviceInventoryRequestException: Error communicating with the service
    ---> System.Net.Http.HttpRequestException: An operation was attempted on something that is not a socket. (stab.di.services....:443)
    ---> System.Net.Sockets.SocketException (10038): An operation was attempted on something that is not a socket.
    at System.Net.Sockets.Socket..ctor(AddressFamily addressFamily, SocketType socketType, ProtocolType protocolType)
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    --- End of inner exception stack trace ---
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
    at System.Threading.Tasks.TaskCompletionSourceWithCancellation1.WaitWithCancellationAsync(CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.HttpConnectionWaiter1.WaitForConnectionAsync(Boolean async, CancellationToken requestCancellationToken)
    at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
    at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpClient.g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, Cance

We have tried various approaches, but none have helped:
For instance

  1. Changing frameworks (Net7, Net8) and Orleans versions (Orleans 3, 7, 8).
  2. Attempting to use the SyncWork package or even moving all grain logic to the default thread pool using Task.Run.
  3. Modifying Orleans settings. Regarding settings, for some reason, even the default connection timeout of 5 seconds was not sufficient in some cases.

All these efforts yielded no results.

I kindly ask for your guidance on what direction we could take next to continue our analysis.

@ins4n333 I will help you investigate. If you'd like, we could take a look together over a Teams call. DM me on Discord and we can investigate together. The discord server invite address is https://aka.ms/orleans/discord.

The deadlock should show up in the Parallel Tasks > Threads view, since active threads are deadlocked (they cannot yield back to the scheduler) - could you double check that the screenshot was taking while there was an active deadlock? Do you have a memory dump of the system during a deadlock?

Yes, we have plenty memory dumps captured by dotnet-dump tool during such anomalies.
I will send you DM on discord.