scientistproject/Scientist.net

Async ResultPublishers could deadlock sync experiments

thematthopkins opened this issue · 21 comments

Thanks for the great library!

One issue we ran into is when using an async ResultPublisher for synchronous experiments within an MVC app.

The controller below will deadlock on every request, because of the awaiting on Task.Delay() within the synchronous web request.

This article has a more in-depth explanation on why:
http://blog.stephencleary.com/2012/07/dont-block-on-async-code.html

It looks like ConfigureAwait could potentially be used for a fix, or possibly adding a separate ResultPublisher method that gets invoked on synchronous calls. Our current workaround is to use sync implementations of Publish without any awaits.

using System.Threading.Tasks;
using System.Web.Mvc;
using GitHub;

namespace ScientistDeadlockingResultPublisher.Controllers
{
    public class AsyncResultPublisher : IResultPublisher
    {
        public async Task Publish<T, TClean>(Result<T, TClean> result)
        {
            await Task.Delay(5000);
        }
    }

    public class DeadlockController : Controller
    {
        // GET: Deadlock
        public ActionResult Index()
        {
            Scientist.ResultPublisher = new AsyncResultPublisher();
            var result = Scientist.Science<int>("myExperiment", e =>
            {
                e.Use(() => 2);
                e.Try("candidate", () => 1);
            });

            return Content("done");
        }
    }

It looks like ConfigureAwait could potentially be used for a fix

To make sure I understand, you're suggesting that when Scientist calls IResultPublisher.Publish, it should append a ConfigureAwait(false) call to that?

Could you try it out in your example here by putting Task.Delay(5000).ConfigureAwait(false); and make sure that doesn't deadlock?

I tested out a couple approaches. The problem with ConfigureAwait(false) is that I had to add it on every point up the call stack. Meaning:
`Task.Delay(5000).ConfigureAwait(false);``

But also changing ExperimentInstance to use
await Scientist.ResultPublisher.Publish(result).ConfigureAwait(false);

As well as doing the same for the top level invocation from Scientist.

It would also require the implementation of the ResultPublisher to include it in all descendant async calls.

It doesn't feel like a great solution, since the ResultPublisher may need to invoke some async libraries that likely don't come pre-wired with ConfigureAwait(false).

Another solution I tested was to switch the invocation of the ResultPublisher.Publish to fire-and-forget as mentioned in the comments:

#pragma warning disable 4014
                Task.Run(async () =>
                {
                    await Scientist.ResultPublisher.Publish(result);
                });
#pragma warning restore 4014

That actually works beautifully and avoids the deadlock scenario completely. The problems with that approach would be:

  • It would break a number of unit tests that verify the ResultPublisher call. We'd have to create something like an IResultPublisherRunner that could be faked out with a non-fire-and-forget implementation.
  • If anyone's accessing something like a static HttpContext.Current in their ResultPublisher, it would become inaccessible after the switch-over to fire and forget.

A good solution for now that avoids these issues seems to be to use fire-and-forget within our ResultPublisher itself:

    public class FireAndForgetResultPublisher : IResultPublisher
    {
        public Task Publish<T, TClean>(Result<T, TClean> result)
        {
#pragma warning disable 4014
            Task.Run(async () =>
            {
                await Task.Delay(5000);
            });
#pragma warning restore 4014
            return Task.FromResult((object)null);
        }
    }

This would still allow us grab info from something like HttpContext.Current before we enter the fire-and-forget bit.

@thematthopkins I really like the idea of the FireAndForgetResultPublisher. It should probably be a wrapper publisher that allows you to pass in your real publisher. So something like:

public class FireAndForgetResultPublisher : IResultPublisher
{
    IResultPublisher _publisher;
    public FireAndForgetResultPublisher(IResultPublisher publisher)
    {
        _publisher = publisher;
    }

    public Task Publish<T, TClean>(Result<T, TClean> result)
    {
        #pragma warning disable 4014
        Task.Run(async () =>
        {
            await _publisher.Publish(result);
        });
        #pragma warning restore 4014
        return Task.FromResult((object)null);
    }
}

@thematthopkins interested in contributing this?

@haacked I made #68 that implements something similar to this, but it is handled at the ExperimentInstance level instead of the IResultPublisher implementation. Is this an option that we'd want to provide consumers of the library to choose to do?

@joncloud what I like about @thematthopkins's suggestion is we won't need to change our unit tests. They'll still use the InMemoryResultPublisher. Instead, we'll just make this FireAndForgetResultPublisher available to users, but it won't be set by default.

Got it - So the application developer can choose to opt-in for synchronous vs asynchronous publishing.

Got it - So the application developer can choose to opt-in for synchronous vs asynchronous publishing.

Exactly. In the example that @thematthopkins pointed out, the app developer might need to access HttpContext.Current during publishing. If we always publish async, their result publisher won't have a chance to store that value before publishing.

This approach gives the developer control, while making the async scenario very easy.

Just curious: on the implementation above is there a reason that Task.FromResult((object)null) is used instead of Task.CompletedTask?

Just curious: on the implementation above is there a reason that Task.FromResult((object)null) is used instead of Task.CompletedTask?

We probably should use that. Is that a new property?

Yeah fairly recently. I just double checked and it was added in .NET 4.6. That'd mean we'd have to change our framework targeting (core would have to go to at least 1.3). Probably not worth the effort. Though if we wanted to we could create the task once and cache it like the actual implementation of Task.CompletedTask does.

Also @thematthopkins let me know and I can help on this contribution if necessary.

@joncloud: I looked through your implementation and it uncovered an issue with the FireAndForgetResultPublisher approach. It's difficult to ensure that the registered exception handler is called. Your implementation handles this nicely, where the FireAndForgetResultPublisher would require IResultPublisher be modified to take an Action<Exception> onThrown.

        public Task Publish<T, TClean>(Result<T, TClean> result, Action<Exception> onThrown)
        {
            Task task = _innerResultPublisher.Publish(result);
#pragma warning disable 4014
            task.ContinueWith(_ =>
            {
                if (task.IsFaulted)
                {
                    onThrown(task.Exception.GetBaseException());
                }
            });
#pragma warning restore 4014
            return Task.FromResult((object)null);
        }

Making the IResultPublisher responsible for its own exception handling doesn't seem right.

With this in mind, I think it's probably best to go with your approach. If context needs to be added from something like HttpContext.Current, it could be done manually within the experiments, or there could be something like a static ContextProvider. I could put together the pull request for an ContextProvider if that seems like a good direction.

It's difficult to ensure that the registered exception handler is called

Why is that?

I could put together the pull request for an ContextProvider if that seems like a good direction.

Let's avoid adding anything until someone has a real need for it.

In the current implementation, the ResultPublisher gets invoked from ExperimentInstance like:

    try
    {
        await Scientist.ResultPublisher.Publish(result);
    }
    catch (Exception ex)
    {
        Thrown(Operation.Publish, ex);
    }

If this invokes an ExceptionThrowingFireAndForgetResultPublisher:

    public class ExceptionThrowingFireAndForgetResultPublisher : IResultPublisher
    {
        public Task Publish<T, TClean>(Result<T, TClean> result)
        {
#pragma warning disable 4014
            Task.Run(async () =>
            {
                await Task.Delay(5000);
                throw new Exception();
            });
#pragma warning restore 4014
            return Task.FromResult((object)null);
        }
    }

Then the Exception never gets caught by the ExperimentInstance, and the registered exception handler is never called (I ran this to confirm that the exception is not caught). @joncloud's solution handles this case correctly by calling the registered exception handler from within the async block.

Oh I see. That leads to a potential different problem though. Right now, we can expect that the Thrown callback is run on the current thread. If someone uses the FireAndForgetPublisher, it could mean Thrown is called on a different thread.

That might not matter. But in some cases it might.

Another option we could do is not change the Publish signature, but instead pass in the Thrown callback to the constructor of the FireAndForgetResultPublisher. Wouldn't that also solve this problem?

Just to be clear are you proposing something like:

Scientist.ResultPublisher = new FireAndForgetResultPublisher(..., ExceptionMonitor.Observe);
Scientist.Science("...", experiment => {
  experiment.Thrown(ExceptionMonitor.Observe);
});

class ExceptionMonitor {
  public static void Observe(Operation operation, Exception exception);
}

The problem with putting it in the constructor is that the FireAndForgetResultPublisher would be static and reused across experiments. Thrown is set per-experiment, so would have to be passed in per Publish call.

Making a ResultPublisher per-experiment would allow us to move it to the constructor. Making ResultPublisher per-experiment would also be nice because each experiment would then be self-contained and not have to depend on separate setup of the static ResultPublisher.

Another potential solution may be to remove support for Operation.Publish from the enum, and say Publishers are responsible for handling their own errors. That would also get rid of the ambiguity about the thread Thrown gets run on.

If i'm reading the ruby implementation correctly it looks like the actual publishing happens in the experiment instance itself, so that would align the closer if the result publisher wasn't a singleton.

In regards to Operation.Publish I think you're right. It could definitely provide focus about where exception handling should take place. The question though is how closely should the APIs mimic the ruby implementation. .NET definitely is different because of our support for async/await, and the Task based result publishing implementation.

I don't think we need to adhere strictly to the Ruby implementation given we have the benefit of tasks and they don't.

However, I have a thought. When you consider the "Forget" part of FireAndForget, it would be fair to interpret that as even exceptions are forgotten. And that the exception handling around the Publish method in this case means that it'll catch exceptions when initiating the fire and forget, but after that, nothing.

In other words, I'm fine that the FireAndForgetPublisher as suggested would not marshal exceptions to the Thrown handler.

However, some users may still want to log exceptions in the publisher so they know that experiments are going well or not. We could do that via a basic onThrown constructor argument.

Right now, I do want to keep the publisher as a singleton because it's unlikely that people would want per experiment publishers and having to set one up each time you set up an experiment seems like unnecessary work.

Thoughts?

That makes a lot of sense. Going back to your original statement I think approaching the simplest solution would be very appropriate. If there are additional needs, then we can always revisit and provide more functionality.

Yeah, I like the idea a lot as well. Added a pull request that borrows heavily from @joncloud's implementation for the unit tests.

When onPublisherException is not supplied, the exception thrown from the resultpublisher left unhandled. Couldn't figure out a way to unit test that bit though, since the exception would be thrown from wherever the fire-and-forget thread is running, and hard to catch.

I think #83 fixed this issue. Pro tip: if you have a commit or PR that fixes an issue, put the text "fixes #80" in the commit message or PR body and when the PR is merged, it'll close the issue.