Wrong app instance state returned by process stats endpoint during graceful shutdown
stephanme opened this issue · 2 comments
Issue
When stopping an application (POST /v3/apps/:guid/actions/stop
), CC sets the desired app state to STOPPED
, triggers the LRP deletion at Diego for all application process instances and returns with 200 (i.e. a synchronous api request).
However, the actual LRPs (= app instances) may continue to run after the stop request finished with 200 because of the graceful_shutdown_interval_in_seconds that Diego grants to running processes.
There is no way for users to find out when the app process instances have really stopped (beside waiting for the graceful shutdown time and some extra time). GET /v3/apps/:guid/processes/:type/stats
returns immediately a status DOWN
after stopping the app even though the instances are still running.
This can lead to issues during graceful shutdown e.g. when a deployment procedure directly unbinds service instances after stopping the application. Depending on the service, app instances can lose access to the service instances immediately which leads to unintended failures during graceful shutdown.
Context
Observed on foundations that use a longer graceful shutdown interval than the default 10s.
Steps to Reproduce
- configure graceful_shutdown_interval_in_seconds to a higher value for easier reproduction, e.g. 5 min
- push an application that ignores SIGINT and SIGTERM, e.g. this python example
import os
import http.server
import socketserver
import signal
def ignore_signal(signum, frame):
print(f"Signal handler called with signal {signal.strsignal(signum)}. Ignoring.")
signal.signal(signal.SIGINT, ignore_signal)
signal.signal(signal.SIGTERM, ignore_signal)
if __name__ == "__main__":
port = int(os.getenv("PORT", 8080))
# port = 8001
with socketserver.TCPServer(("", port), http.server.SimpleHTTPRequestHandler) as httpd:
print("serving at port", port)
httpd.serve_forever()
- stop the running app:
cf stop
- observe that
cf stop
returns immediately and that the process stats return stateDOWN
cf curl /v3/apps/db84b476-a386-4308-b517-f609b586c8af/processes/web/stats | jq
{
"resources": [
{
"type": "web",
"index": 0,
"state": "DOWN",
"routable": null,
"uptime": 0,
"isolation_segment": null,
"details": null
}
]
}
- check application logs to validate that the app instance containers got destroyed after checking the process stats and when graceful_shutdown_interval_in_seconds expired
Expected result
GET /v3/apps/:guid/processes/:type/stats
returns DOWN
only when the app process instances are not running anymore. During graceful shutdown, a state of RUNNING or maybe STOPPING should be reported.
Current result
GET /v3/apps/:guid/processes/:type/stats
returns DOWN
immediately after stopping the app, even though the app process instances are still running during graceful shutdown period.
Possible Fix
instances_stats_reporter.rb should not simply report DOWN for an app instance when the desired LRP is not found but it should request the actual LRP additionally to determine the instance state.
When the desired LRP doesn't exist but an actual LRP still exists, the state could be set to STOPPING
.
Will be shipped with https://github.com/cloudfoundry/capi-release/releases/tag/1.184.0.
Closing the issue.