Azure/azure-webjobs-sdk

Provide a configurable retry policy for function execution failures

Closed this issue ยท 31 comments

The issue was created from discussion here.

The current design of how event hub triggered functions work is optimized around making sure that bad messages or incorrect function code do not prevent the system from processing new messages and do not create infinite processing loops that generate out of control bills in the consumption plan. However this behavior has a downside in that it makes it difficult for the application developer to ensure that their function successfully processes ALL messages written to the event hub.

In particular, even if the application developer implements their function with general error handling that catches a processing failure and attempts to store the message for later processing, this error handling won't help them if their function failed to run in the first place, perhaps due to some sort of assembly loading issue (this can happen due to user error or due to platform issues).

This issue tracks adding a configurable retry. If dispatching an execution to the function fails (either within the function or before the user code even runs), then the execution is retried up to the specified number of times, with some backoff. These retries would only occur within the currently executing process - they are not persisted anywhere and if the process crashes or shuts down then the retry count is essentially reset. A checkpoint would not be written until all executions for a given batch have completed successfully or the retry count is reached. The retry count can be set to a special "infinite" value to indicate that a batch should ONLY be checkpointed when all function invocations for the batch completed successfully.

Turning this feature on has some implications that are worth spelling out. Say you have function code that does not handle all errors and there is a message on the event hub that causes the function to throw an exception and fail. If this function app was configured with an "infinite" retry policy then processing of the partition that this message was on would essentially get "stuck". Any data on that partition that was written after the bad message would not be processed until the bad message falls off the event hub due to its retention policy or the function code is updated by the developer to handle the bad message. In the meantime, the other events in that batch might be reprocessed many times. Every time that partition gets a new owner, it would resume from the "stuck" batch and reprocess all the messages in that batch. This could theoretically lead to a given message on the event hub being processed by the function code hundreds or thousands of times.

This feature has significant parallels with a current PR open against the cosmosdb binding extension:
Azure/azure-webjobs-sdk-extensions#349

More review is required to determine whether this feature would be useful as described, whether it has other side effects that need to be discussed, and whether this feature can land in functions V2 only or if it would also have to be backported to V1.

/cc @tobiasb

๐Ÿ‘ for an "infinite" retry value. Please consider making the backoff strategy configurable as well. For us, whenever we have external dependencies we will implement the backoff strategy in application code so I would be interested in an immediate retry.

Assuming this is something you consider worth implementing, I am very interested in the timeline of this as I am the one responsible for us to find an alternative if this won't get resolved by a certain date.

Thank you for working with us on this!

There's some current patterns you can do in Event Hubs today to handle some retries in the function itself. I recently posted a blog about it you can check out. Imagine many of these patterns could be host or trigger settings and actually run in the event hub listener for the host

Quick fix. As retry policy would not work if down system is down for few hours. You can call Process.GetCurrentProcess().Kill(); in exception handling. This would stop the checkpoint moving forward. I have tested this with consumption based function app. You will not see anything in logs but i added email to notify that something went wrong and to avoid data loss i have killed the function instance.
Hope this helps.
Would put an blog over it and other part of workflow where I stop function in case of continuous failure on down system using logic app.

I agree more control on the retry policy would be really helpful to combine the throughput of event hubs with a sufficient level of reliability somewhat similar to what queues provide.

It may also be worth looking at the specific behaviour in case of Function Timeouts (Microsoft.Azure.WebJobs.Host.FunctionTimeoutException). According to my tests, a timeout results in the same behaviour as any exception during function execution (checkpoint is still performed). I understand that this makes sense as a default behaviour to avoid getting stuck on an execution that can always last longer than the specified timeout but in some instances like transient issues with slow external resources, it would be useful to be able to retry.

Adding P1 so we discuss in upcoming sprint planning to see when we can start work on this.

// @fabiocav FYI

@alrod thinking would be start to flesh out design. Would be great if you could set up a short sync / brain dump with myself, @paulbatum, and even @ealsur (from CosmosDB) to look at options and share scenarios.

@alrod has an initial design for review. Once we settle on that, we'll move forward with implementation.

Hey all - we met today and @alrod shared his proposal. I tried to put together one potential option in a sample here. It has a README that lays out the behavior, and a sample of what an Event Hub trigger function could look like that allows users to explicitly determine to checkpoint or not. Feel free to check it out and add feedback to this issue. https://github.com/jeffhollan/retry-design

@jeffhollan we are excited about this development. ๐Ÿ‘

One outstanding concern we have is with how checkpointing is handled for errors that occur in the Function/binding middleware. When we first started trying out Event Hub for a big project in 2018, we quickly discovered that unhandled exceptions occurring before our our Function method body (i.e. application code) would not stop the event hub trigger from checkpointing. This forced us to write a custom IEventProcessor#ProcessEventsAsync() implementation that wraps the await TryExecuteAsync() in a try-catch, retries on exceptions 5 times, and finally stores a lock for the failing partition key (not partition id) in table storage (we require ordered events).

We run this custom IEventProcessor in a continuous webjob in an App Service which means we cannot take advantage of scaling out like we could with the Function trigger. We are at a point of really needing to scale out in production today.

My question is: can this new retry proposal also handle unhandled exceptions that happen in Microsoft.Azure.WebJobs.EventHubs.EventProcessor#ProcessErrorAsync()? Unless I am misunderstanding that method, it will continue checkpointing if there is an error in there, yes?

@jeffhollan We at my company are also excited about the additional configuration for event hub triggered function. As @gabrieljoelc mentioned, using Event Hub triggered functions often requires an almost paranoid amount of exception checking and error handling, and even that is not always enough. Having a configurable retry policy will go a long ways in improving the resilience and reliability of our Function Apps.

Perhaps this issue is not the place, but it would be nice if we could configure the retry policy to put any events/batches that failed the retry policy into a failed/poison event hub so that they can be reprocessed as needed. Is this something that can be included?

My question is: can this new retry proposal also handle unhandled exceptions that happen in Microsoft.Azure.WebJobs.EventHubs.EventProcessor#ProcessErrorAsync()? Unless I am misunderstanding that method, it will continue checkpointing if there is an error in there, yes?

@gabrieljoelc I'll have to have @alrod clarify that as I don't know what the implementation would mean for this level

Perhaps this issue is not the place, but it would be nice if we could configure the retry policy to put any events/batches that failed the retry policy into a failed/poison event hub so that they can be reprocessed as needed. Is this something that can be included?

@TeraInferno great idea, likely makes sense to create a separate issue in this repo specifically for that feature. It's just a little larger as we'd also need the ability for the user to define / declare a 'deadletter' event hub name, but I love the idea.

@jeffhollan , is this already available (the retry design) for usage ? We're suffering from missing events when we put a lot of pressure in an eventhub. We've wrapped almost every part of our code with a try-catch but it's not enough, we're still seeing missing events.

@jeffhollan looks pretty good. It's a shame that throwing exception won't result in batch being retried, but I guess that ship has sailed.

Would it be possible to add circuit breaker into this design? Today we had a scenario where downstream dependency was throwing non-transient exception and we needed to stop while we fixed it. Or deadletter would help in this scenario as @TeraInferno mentioned.

@jelther not ready yet - @alrod is working on it.

@DanielLarsenNZ circuit breaker adds an additional element of complexity as it requires having some "state" of the circuit. This could be in memory but that wouldn't get you very far if an instance was recycled and the state was lost. I've written this blog post for a pattern I've used. https://dev.to/azure/serverless-circuit-breakers-with-durable-entities-3l2f

We have a design for this item, work is currently in progress. Tracking this for completion in sprint 72.

@jeffhollan @fabiocav
Will this be available on which Function versions?
Does Durable Functions suffer with the same problem and will also receive the feature?

@fabiocav will this design you're working on also be usable for CosmosDB triggered functions?

alrod commented

PR:
#2463

alrod commented

@fabiocav will this design you're working on also be usable for CosmosDB triggered functions?

@ThomasVandenbon, we have plans to use it in EventHub, CosmosDB and Kafka. First trigger we are going to support is EventHub.

PR is open. Moving this to sprint 73 for completion.

@alrod from my comment above:

My question is: can this new retry proposal also handle unhandled exceptions that happen in Microsoft.Azure.WebJobs.EventHubs.EventProcessor#ProcessErrorAsync()? Unless I am misunderstanding that method, it will continue checkpointing if there is an error in there, yes?

will this PR handle retries for this?

Sorry to be late to the party. I am rather new to EventHub and quite surprised to learn about the existing behavior. I landed here after trying to find a thorough API definition document declaring the behavioral consequences of the possible interactions between my code and the function infrastructure (thank you for the blogs and example code Jeff!). I am now quite worried about data loss and inconsistency in my materializations of the log.

The retry proposal offered seems like a great and sufficiently workable start. Thank you for that and the associated PR. I am particularly glad to see the "infinite" retry option. I expect processing for any associated partition to cease if a function instance continuously fails to process its messages (especially if throwing an exception into the function infrastructure context [i.e. the end user context totally lost control by passing it back to the infrastructure due to unexpected and unhandled failure - perhaps this could also be made configurable]). I can control my flow and would be perfectly happy to explicitly indicate to the infrastructure that retry should occur by returning an object/value or calling a provided method with the same [edit: as indicated in the proposal - still, I'd prefer a solid default behavior over feeding that with logic] . I was a little surprised that the proposal suggests that retry behavior would be defined at the scope of the host as opposed to that of the function (i.e. the unit of failure).

Ideally under conditions of continued log processing function failure a dithered exponential reduction of retry frequency would occur until some reasonable ceiling is hit (maybe single or double digit minutes?). Also ideally, the frequency would be reset if a new deployment of the function occurs (i.e. a fix is deployed) and of course if the previously failing message(s) are at any time successfully processed.

Basically, I expect the presses to be stopped for a partition if its messages cannot be processed. I want to wake up (or take a look in the morning, depending business processes criticality) if this ever happens (and I do a lot to make sure it never does). To support the developer workflow around the configuration I am describing I would very much like to be able to run a function, send an email, or have some other mechanism for setting off my mitigation processes. Metrics that would support this might include "how many repeated failures have been observed?", "how far behind the write offset is a function?", et cetera.

Is this retry feature available now to use ? I can see PR is still open

is there any ETA for getting this feature ? We are badly looking for it, its very painful that it is taking such a long time to implement it.

@harishbhattbhatt no ETA at this point, but this is being prioritized, so we hope to have this land soon.

Thank you for your patience!

I'm still interested if the retries will cover exceptions in the function extension as I've previously posted. I added another #2463 (comment) on the PR for more context.

@gabrieljoelc - regarding your comment #2463 (comment)

  • PR adds generic support of function execution retries in Webjobs SDK
  • Delay in between retries can be customized by the user
  • In case of event hubs, using retry would mean that on function invocation failure, checkpointing will only happen after all the retry attempts have been exhausted.

I'm also interested in this. Just throwing it out there but something like what AWS Lambda does called Bisect on Function Error would be useful. That could potentially cut down on duplicate data when an invocation with a large batch size fails.

@pragnagopa thank you for the feature summary.

However, this doesn't answer my question about errors happening in the function event hub trigger middleware (before it runs function body).

@gabrieljoelc - Check pointing should not happen if exceptions are hit before function code is even executed. Please open a separate issue for that. Would be great if you can attach a simple repro.

Addressed in PR #2463

This unfortunately only works for in-process as retry policy as it appears is not supported in isolated funcs, so in case of cancellation/shutdown etc. checkpoint advance would still happen..