inadarei/rfc-healthcheck

Add structured "impacts" field for graceful degradation

smoyer64 opened this issue · 0 comments

Many micro-services have a few dependencies and it's pretty obvious that when the back-end's database connection has failed, the micro-service is completely down. Things aren't nearly so clear when the hierarchy of micro-services is more than one layer deep. I've added #51 in an effort to allow components to incorporated the top-level health response object as a check, but this hierarchy obviously also impacts how you calculate the top-level status. The failure of some components might only result in a top-level warn status. Or maybe a full-text indexer's failure wouldn't change a pass status since you could conceivably catch up later (and operate without it until then).

This is even more obvious when you consider the recommended UI practice of graceful degradation. For instance, our user account management system relies on about 20 different back-end systems but the UI only outright fails for a few of them. In such cases, we need a way for the UI's healthcheck end-point to report on the impact that each degraded or failed component has on the UI's back-end. We'd propose something like:

  ...
  "impacts": [
    {
      "impactId": <uuid string>,
      "checkKey": <string>,
      "impactDetail": <string>,
      "recommendsStatus": <status string>
    },
    ...
  ],
  ...

Three important notes about the format above:

  1. The impactId field is primarily for debugging and log analysis. While a random UUID is suggested, the important aspect of this ID is that it's a unique, constant string.

  2. The impactDetail field is a human-readable description that details what is currently non-functional.

  3. The recommendedStatus field is NOT the status returned by the component but is rather the status that will be used to calculate the top-level status. This is a subtle difference but in many cases allows the top-level status to be calculated as the most severe of the impact recommendsStatuses.

Using the account UI healthcheck as an example, the JSON resulting from a Kerberos outage and an SMS outage would produce the following impacts section:

  ...
  "impacts": [
    {
      "impactId": "47619208-2556-41a4-a72c-801209b8ed9e",
      "checkKey": "kerberos:connection",
      "impactDetail": "The user will be unable to change their password",
      "recommendsStatus": "warn"
    },
    {
      "impactId": "85ad165d-9edf-4da5-8d95-93d299673680",
      "checkKey": "sms:connection",
      "impactDetail": "The user will be unable to perform self-service account recovery",
      "recommendsStatus": "warn"
    }
  ],
  ...

Receiving these impacts allows the UI to adopt a couple very useful behaviors:

  1. The UI can use the impactDetail information to tell the user precisely which functions are not available (as simple as putting a toast at the top of a screen).

  2. The UI can use the unique, constant impactId value to conditionally disable or hide the control elements for those functions.

Two more benefits of this format are:

  1. The top-level status field can be calculated as warn based on the severity of the two underlying failures.

  2. A human looking at the health response object can determine which checks contributed to the calculation of the top-level status.