dotnet/orleans

Awaiting this.AsReference call causes deadlock in Orleans 7+

cbgrasshopper opened this issue · 5 comments

I created a minimal reproducible case for a significant change in behavior I'm seeing between Orleans 3.x and 7+; namely, a grain that awaits a call to itself via this.AsReference works fine in Orleans 3.x and deadlocks in 7+.

The following code (adapted from the minimal Orleans sample app) works in Orleans 3.x but deadlocks in Orleans 7+ when the client calls SayHello:

public async ValueTask<string> SayHello(string greeting)
    {
        _logger.LogInformation("""
                               SayHello message received: greeting = "{Greeting}"
                               """,
            greeting);

        var name = await this.AsReference<IHelloGrain>().GetName();

        // var name = GetName();

        return $"""
                {name} said: "{greeting}", so HelloGrain says: Hello!
                """;
    }

    public ValueTask<string> GetName()
    {
        return ValueTask.FromResult("Name");
    }

Using AsReference provide you a proxy class . The same use by external to access the grain.
So the rule of mono thread access occured.
You are in the grain and you try to access the same grain method passing by the outside => DEAD LOCK

It's like entering your house, closing the door, leaving from the windows and try the re-enter by the locked door.

Try tagging "GetName" with alwaysinterleaveattribute

This will allow recursive call.

So I guess the bug was in Orleans 3.x, which did not deadlock?

I understand the principle and I'm pretty sure the intentions were the same in Orleans 3.x. It just so happened that we had a few places in our Orleans 3.x implementation that used code similar to my example without any problem. We are currently migrating to Orleans 8.x and encountered the deadlock and were surprised by the difference in behavior.

In 3.x, there was a global option set on by default to try to allow call-chain reentrancy. The implementation was not effective or correct. In 7.x+, you still have the option, but it's on a per-call-site basis. See the docs on call-chain-reentrancy here, with a demonstration of how to accomplish this: https://learn.microsoft.com/en-us/dotnet/orleans/grains/request-scheduling#call-chain-reentrancy