Wrong app instance state returned by process stats endpoint during graceful shutdown

Question

Wrong app instance state returned by process stats endpoint during graceful shutdown

stephanme opened this issue 8 months ago · 2 comments

Issue

When stopping an application (POST /v3/apps/:guid/actions/stop), CC sets the desired app state to STOPPED, triggers the LRP deletion at Diego for all application process instances and returns with 200 (i.e. a synchronous api request).
However, the actual LRPs (= app instances) may continue to run after the stop request finished with 200 because of the graceful_shutdown_interval_in_seconds that Diego grants to running processes.

There is no way for users to find out when the app process instances have really stopped (beside waiting for the graceful shutdown time and some extra time). GET /v3/apps/:guid/processes/:type/stats returns immediately a status DOWN after stopping the app even though the instances are still running.

This can lead to issues during graceful shutdown e.g. when a deployment procedure directly unbinds service instances after stopping the application. Depending on the service, app instances can lose access to the service instances immediately which leads to unintended failures during graceful shutdown.

Context

Observed on foundations that use a longer graceful shutdown interval than the default 10s.

Steps to Reproduce

configure graceful_shutdown_interval_in_seconds to a higher value for easier reproduction, e.g. 5 min
push an application that ignores SIGINT and SIGTERM, e.g. this python example

import os
import http.server
import socketserver
import signal

def ignore_signal(signum, frame):
    print(f"Signal handler called with signal {signal.strsignal(signum)}. Ignoring.")

signal.signal(signal.SIGINT, ignore_signal)
signal.signal(signal.SIGTERM, ignore_signal)

if __name__ == "__main__":
    port = int(os.getenv("PORT", 8080))
    # port = 8001
    with socketserver.TCPServer(("", port), http.server.SimpleHTTPRequestHandler) as httpd:
        print("serving at port", port)
        httpd.serve_forever()

stop the running app: cf stop
observe that cf stop returns immediately and that the process stats return state DOWN

cf curl /v3/apps/db84b476-a386-4308-b517-f609b586c8af/processes/web/stats | jq
{
  "resources": [
    {
      "type": "web",
      "index": 0,
      "state": "DOWN",
      "routable": null,
      "uptime": 0,
      "isolation_segment": null,
      "details": null
    }
  ]
}

check application logs to validate that the app instance containers got destroyed after checking the process stats and when graceful_shutdown_interval_in_seconds expired

Expected result

GET /v3/apps/:guid/processes/:type/stats returns DOWN only when the app process instances are not running anymore. During graceful shutdown, a state of RUNNING or maybe STOPPING should be reported.

Current result

GET /v3/apps/:guid/processes/:type/stats returns DOWN immediately after stopping the app, even though the app process instances are still running during graceful shutdown period.

Possible Fix

instances_stats_reporter.rb should not simply report DOWN for an app instance when the desired LRP is not found but it should request the actual LRP additionally to determine the instance state.

When the desired LRP doesn't exist but an actual LRP still exists, the state could be set to STOPPING.

Answer 1 · 2024-06-20T13:00:30.000Z

Possible fix was applied with #3834

Answer 2 · 2024-06-26T08:34:54.000Z

Will be shipped with https://github.com/cloudfoundry/capi-release/releases/tag/1.184.0.
Closing the issue.