dotnet/orleans

Calls to grains never complete on cluster change

krasin-ga opened this issue · 1 comments

We have encountered an issue where calls to grains never complete, and the response timeout does not take effect. In our scenario, the grain method includes a call to another grain and some non-blocking code. And the fact that timeouts did not take effect leads to grain being blocked forever.

The start of the blocking issue aligns with the moment when one of the silos joins the cluster and the other leaves it(gracefully). The response timeout is set to 150ms.

Logs (redacted):

Current request has been active for 00:54:10.1140000 for grain "[Activation: S10.125.156.43:11111:66744139/user/3A0454C+rNZWnDcZS36ZUcLoASCbUA@52197045f48c4a07b9765de50b142569 #GrainType=_.UserGrain,_ Placement=UserPlacementStrategy State=Valid NonReentrancyQueueSize=120 NumRunning=1 IdlenessTimeSpan=00:54:19.8550000 CollectionAgeLimit=00:02:00]". 
Currently executing "Request [S10.125.156.43:11111:66744139 sys.client/hosted-10.125.156.43:11111@66744139]->[S10.125.156.43:11111:66744139 user/3A0454C+rNZWnDcZS36ZUcLoASCbUA] _.IUserGrain[(_.IUserGrain)_.UserGrain].GetProfile(Orleans.Concurrency.Immutable`1[System.Collections.Generic.Dictionary`2[_,_._[]]], ) #185620326". 
Trying to enqueue "Request [S10.133.237.230:11111:66744137 sys.client/hosted-10.133.237.230:11111@66744137]->[S10.125.156.43:11111:66744139 user/3A0454C+rNZWnDcZS36ZUcLoASCbUA] _.IUserGrain.GetProfile(Orleans.Concurrency.Immutable`1[System.Collections.Generic.Dictionary`2[_,_._[]]], ) #213484691".

Sample code:

class UserGrain
{
    ISomeOtherGrain _otherGrain;

    public UserGrain(IGrainFactory grainFactory)
    {
        _otherGrain = grainFactory.GetGrain<ISomeOtherGrain >(...)
    }

    //non-interleaving method
    GetProfile()
    {
        var state = await _otherGrain.GetSomeState(); // <- never completes

        //some non-blocking code
        return ProcessState(state);
    }
}

Orleans version: 7.2.4

So far, this issue has not been reproduced again, and I have not been able to capture a trace to identify the cause.

Only one scenario I can think of is that the call hangs in one of the outgoing call filters before sending the request, so it will never go into the callbacks field of the runtime client, making it impossible to complete by timeout.

As an idea: maybe it's worth accounting for time spent in filters and adding cancellation support to them.