dotnet/runtime

`Microsoft.Extensions.Hosting.HostOptions.ShutdownTimeout` default value of 5s is too short

ReubenBond opened this issue · 3 comments

The HostOptions.ShutdownTimeout value determines how long the host waits for services to gracefully terminate before initiating ungraceful termination.

The default value is currently 5 seconds, which is too short for many production applications. Processes in such applications often have state which needs to be stored in a database or responsibilities which need to be offloaded onto their surviving peers. Eg, consider a process which is the leader in a quorum of processes. It may take longer than 5s to reliably transfer leadership to a new leader before shutdown in that case.

Applications can never rely on graceful shutdown for continued correct operation, but this is a question of graceful shutdown.

In principle, the default HostOptions.ShutdownTimeout value should be long enough that functioning applications can shutdown gracefully and short enough that misbehaving processes do not prevent the application from functioning for too long. Both are subjective bounds, but I suggest we change the default value to 30 seconds, in line with Kubernetes (see terminationGracePeriodSeconds in the reference docs):

Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request. Value must be non-negative integer. The value zero indicates delete immediately. If this value is nil, the default grace period will be used instead. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. Defaults to 30 seconds.

Of course, this value is configurable, but we don't want developers to need to configure this value for typical applications. It's better to err towards the value being too high than too low (i.e, we shouldn't optimize for malfunctioning applications).

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

Tagging subscribers to this area: @dotnet/area-extensions-hosting
See info in area-owners.md if you want to be subscribed.

Issue Details

The HostOptions.ShutdownTimeout value determines how long the host waits for services to gracefully terminate before initiating ungraceful termination.

The default value is currently 5 seconds, which is too short for many production applications. Processes in such applications often have state which needs to be stored in a database or responsibilities which need to be offloaded onto their surviving peers. Eg, consider a process which is the leader in a quorum of processes. It may take longer than 5s to reliably transfer leadership to a new leader before shutdown in that case.

Applications can never rely on graceful shutdown for continued correct operation, but this is a question of graceful shutdown.

In principle, the default HostOptions.ShutdownTimeout value should be long enough that functioning applications can shutdown gracefully and short enough that misbehaving processes do not prevent the application from functioning for too long. Both are subjective bounds, but I suggest we change the default value to 30 seconds, in line with Kubernetes (see terminationGracePeriodSeconds in the reference docs):

Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request. Value must be non-negative integer. The value zero indicates delete immediately. If this value is nil, the default grace period will be used instead. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. Defaults to 30 seconds.

Of course, this value is configurable, but we don't want developers to need to configure this value for typical applications. It's better to err towards the value being too high than too low (i.e, we shouldn't optimize for malfunctioning applications).

Author: ReubenBond
Assignees: -
Labels:

untriaged, area-Extensions-Hosting, in pr

Milestone: -

@eerhardt assigned this to you since you're reviewing the PR.