Azure/azure-functions-durable-extension

Activities fail with timeouts when the .NET worker process crashes

Opened this issue · 11 comments

Description

Hello team, we have an activity function which runs fairly slow (we understand that's not the best practice and we are working on a rework of the related code). During the execution of the activity function, another function timeout triggered a restart of the function process and terminated the execution of the said activity function. After the restart, the activity function never seemed to be picked up again and eventually lead to a timeout of the parent orchestration function - the behavior results into cascaded timeout & restarts.

We want to understand what might have caused the activity function to not be picked up? We have set the maxConcurrentActivityFunctions to 4000 and according to our telemetry the threshold didn't get hit. After the restart, the function process on the particular worker seemed to be working fine as we observed telemetry from other functions, just not the particular one that got terminated. Thread count / memory / CPU all look normal for the function app on the worker instance.

Expected behavior

unfinished activity function to be picked up again after the function process restarts

Actual behavior

unfinished activity never get picked up

App Details

  • Durable Functions extension version (e.g. v1.8.3): Microsoft.Azure.Functions.Worker.Extensions.DurableTask (isolated mode) 1.1.2
  • Azure Functions runtime version (1.0 or 2.0): v4
  • Programming language used: C#

Hi @dxynnez. I wonder if you might be running into a variant of #2788, where a worker process crash causes the orchestration to fail rather than being automatically retried. If so, we recently released a fix in the v1.1.7 release. Can you try this updated nuget package and let us know if it resolves your issue?

/cc @davidmrdavid @andystaples

Hi @cgillum , thanks for helping on this. But I am not entirely sure if the root cause is the same as:
the Durable Functions extension running in the host, detects this crash but misinterprets the error it sees as an application failure rather than an infrastructure failure

So our activity function would emit logs at different stages of execution. However, we did not see any log from the particular activity function after the restart of the worker process (that’s why we suspect that the unfinished activity function didnt get picked up again after the restart) - we only see logs from the particular activity function before the restart. But the activity function itself did not timeout as per our logs it only ran for like ~20mins (not finished because of the worker process restart). The restart was caused by other function which ran more than 1hr and timeout.

@dxynnez thanks for the additional context. I recognize that the symptoms you are seeing are different from what others have reported, but I still think there's a chance that this might be resolved in the v1.1.7 release, which addresses several possible problems related to worker process restarts. If you can, please upgrade to this version of the extension to see if it improves the situation. Otherwise, please open an Azure support request so that we can look more closely at our system telemetry to see if we can further root cause what happened.

Thanks @cgillum . we will try that.

One question though - would the orchestration function itself timeout? Also when an orchestration function replays and invokes an already-started but not-yet-finished activity function, if it's already more than the 'functionTimeout' time past the original activity started time, would the orchestration function deem that the activity function already timeout and fail directly, without waiting for an explicit timeout signal from the activity function?

The orchestration function was throwing:

Microsoft.DurableTask.TaskFailedException: Task '#_TheActivityFunction_#' (#1) failed with an unhandled exception: Timeout value of 01:00:00 was exceeded by function: Functions.#_TheActivityFunction_#
 ---> DurableTask.Core.Exceptions.TaskFailedException: Exception of type 'DurableTask.Core.Exceptions.TaskFailedException' was thrown.

Would the orchestration function itself timeout?

No, orchestrator function execution is considered "completed" by the Functions runtime whenever they await an activity or some other task (even though the orchestration itself remains in a "Running" state). This is why orchestrator functions can run for infinite periods of time, and also why a long activity execution will not cause the orchestrator function to timeout.

Also when an orchestration function replays and invokes an already-started but not-yet-finished activity function, if it's already more than the 'functionTimeout' time past the original activity started time, would the orchestration function deem that the activity function already timeout and fail directly, without waiting for an explicit timeout signal from the activity function?

If an orchestration successfully schedules an activity function to run, it will never schedule it to run again during a replay. It doesn't matter whether the activity actually started, completed, or is still running.

Task '#TheActivityFunction#' (#1) failed with an unhandled exception: Timeout value of 01:00:00 was exceeded by function: Functions.#TheActivityFunction#

Aha, this tells me that the activity failed with an exception, so it was considered "complete" by the framework and therefore doesn't need to be retried. I think this is fundamentally what the issue is - the timeout caused the activity to fail, which is why it doesn't get retried. The fact that the worker process was also restarted wouldn't matter, since it restarted after the failure event was sent to and received by the orchestration. To work around this, you can simply introduce a retry policy for your slow activity function call. That way, the orchestration knows that it should retry the activity function if it sees an exception like this.

Documentation on retry policies can be found here: https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-error-handling?tabs=csharp-inproc#automatic-retry-on-failure

Thanks @cgillum !

the timeout caused the activity to fail, which is why it doesn't get retried

the case is more complicated than that: so the particular invocation of the activity function didn't timeout, it executed for around 20mins (we set the 'functionTimeout' to 1hr) before being terminated as a result of the worker process restart, which was triggered by a timeout from another function. After the worker restarted, we did not see any sign of the particular activity function being picked up again (no logs) - we expect that after the restart the worker should somehow rerun the activity function as it's 'incomplete', but we only observe that the above exception being thrown from the owning orchestration function, a few minutes past the worker restart as at that point it was indeed 1hrs past when the activity function was scheduled.

it was considered "complete" by the framework and therefore doesn't need to be retried

I think the confusing part is how would the framework consider the activity function to be complete - the only execution of the activity function got terminated half way through and never got picked up again.

Retry policy probably wouldn't help in our case - since this is a cascaded timeout events, the retry would most likely get terminated again when the worker process get restarted.

I hope that makes things clearer.

@dxynnez I understand the nuance you're describing, but I'm trying to rationalize it with the error message you shared earlier:

Microsoft.DurableTask.TaskFailedException: Task '#TheActivityFunction#' (# 1) failed with an unhandled exception: Timeout value of 01:00:00 was exceeded by function: Functions.#TheActivityFunction#
---> DurableTask.Core.Exceptions.TaskFailedException: Exception of type 'DurableTask.Core.Exceptions.TaskFailedException' was thrown.

Is the orchestration that encountered this error the parent of the activity call which was killed after 20 minutes (in which case the error message is wrong) and isn't retrying? Or is this error message associated with some other orchestration?

@cgillum

Is the orchestration that encountered this error the parent of the activity call which was killed after 20 minutes (in which case the error message is wrong) and isn't retrying?

Yes, the orchestrator is the parent of the mentioned activity function. The error is thrown when the orchestrator function got replayed, after the worker process restart. I am not sure how many times the orchestrator got replayed, but we did not see any sign of the re-execution of the activity function - again the activity function did not finish in its 1st execution (got terminated). We expect after the worker process restart, the framework should notice its unfinished state and restart its execution, which never seemed to happen.

Thanks @dxynnez. I'm consulting with a few other engineers on the team, and one theory we have is that the functions host is not being properly notified when there is a worker process recycle. As a result, your activity function may have been killed after just 20 minutes (due to the recycle) but the host doesn't know this, and thinks the activity is still running. As a result, it eventually hist the 1-hour timeout limit and is reported as failed (which is why it doesn't get retried).

We're still working through the details, but the bug you're encountering might be in the Azure Functions host. We'll share more updates when we have them.

Thank @cgillum ! That sounds like a reasonable guess, please keep us updated!

Before this can be fixed in the extension, the Functions Host should fix Azure/azure-functions-host#10580