alexliesenfeld/health

Make health check runnable by name

Closed this issue · 10 comments

PV99 commented

Currently, the Checker interface exposes a method Check(ctx context.Context) CheckerResult which runs all synchronous health checks. It should be possible to add a new method CheckByName(ctx context.Context, name string) (CheckResult, CheckerResult) which allows execution of a single synchronous check by name.

PV99 commented

I'm happy to take this on, can you assign this to me?

Hi @PV99 , thanks for creating this issue. Could you please elaborate a little bit what your use case is?

PV99 commented

Hey @alexliesenfeld. Thanks for building out such an awesome framework! It's been a pleasure to use.

Here's some more about the use cases. A first use case is for checks that we don't want run by default. Sometimes, permissions to use the api's defined in a check change from environment to environment (e.g. permissions to access the redis-configuration api to execute the redis check-aof function may change from one customer to another). In these cases, these checks should not be run by default, but should be possible to trigger if desired.

A second use case is if we have expensive checks which we do not want running periodically. In these cases, these checks should not be executed by default, but it should be possible to trigger if desired.

PV99 commented

Hey @alexliesenfeld, here are some rough thoughts on what the design would look like. Hopefully, the design doesn't break the API in any way.

Base Design

  1. Update the Checker Interface with a new method:
    CheckByName(ctx context.Context, name string) (CheckResult, CheckerResult, error)

  2. Implement this method in defaultChecker:

// CheckByName implements Checker.CheckByName. Please refer to Checker.CheckByName for more information.
func (ck *defaultChecker) CheckByName(ctx context.Context, name string) (CheckResult, CheckerResult, error) {
        var (
		resChan            = make(chan checkResult, 1)
	)
        ck.mtx.Lock()
	defer ck.mtx.Unlock()

	ctx, cancel := context.WithTimeout(ctx, ck.cfg.timeout)
	defer cancel()

        check, ok := ck.cfg.checks[name]
        if not ok { 
                return checkResult{}, ck.mapStateToCheckerResult(), fmt.Errorf("Check with name: '%s' is not registered in Checker")
        }
        checkState := ck.state.CheckState[name]
        if !isPeriodicCheck(check) && isCacheExpired(ck.cfg.cacheTTL, &checkState) {
                
                withCheckContext(ctx, check, func(ctx context.Context) {
			_, checkState := executeCheck(ctx, &ck.cfg, check, checkState)
                         resChan <- checkResult{check.Name, checkState}
		})
        }
        checkResult := <- resChan 
        ck.updateState(ctx, checkResult)
	return checkResult, ck.mapStateToCheckerResult(), nil
}

This feature alone would be a huge win. If we wanted, we could run most checks via Checker.Check. But if there are checks we don't want to run by default, we could maintain a second instance of a Checker with manual checks and run each check as desired via Checker.CheckByName.

Extensions

Maintaining a second instance of a Checker is tedious for users of this API. To take this design even further, we could also introduce a notion of "manual" checks which don't get triggered by default with Checker.Check, but can be triggered by Checker.CheckByName. This would allow users of the api to only maintain one instance of a Checker.

  1. Update the Check struct with a manual: bool field
Check struct {
		// The Name must be unique among all checks. Name is a required attribute.
		Name string // Required

	         ... 
	         
		initialDelay   time.Duration
		
		isManual      bool
	}
  1. Introduce a new WithManualCheck method: ``
// WithManualCheck adds a new health check that contributes to the overall service availability status.
 // This check will NOT be triggered when Checker.Check is called, but may be invoked manually via 
// Checker.CheckByName. 
func WithManualCheck(check Check) CheckerOption {
	return func(cfg *checkerConfig) {
                 check.isManual = true
		cfg.checks[check.Name] = &check
	}
}
  1. Add an IsManualCheck method:
func isManualCheck(check *Check) bool {
	return check.isManual
}
  1. Update runSynchronousChecks to run all synchronous checks sans manual checks
func (ck *defaultChecker) runSynchronousChecks(ctx context.Context) {
	...
	for _, c := range ck.cfg.checks {
		checkState := ck.state.CheckState[c.Name]
		if !isPeriodicCheck(c) && !isManualCheck(c) && isCacheExpired(ck.cfg.cacheTTL, &checkState) {
			...
		}
	}
        ...
}
  1. Finally, update the interface so that Checker.Check reflects that it runs synchronous checks sans manual checks.
// Check runs all synchronous (i.e., non-periodic) check functions sans
// manual check functions. It returns the aggregated health status 
// (combined from the results of this executions synchronous checks and 
// the previously reported results of asynchronous/periodic checks. This 
// function expects a context, that may contain deadlines to which will be 
// adhered to. The context will be passed to all downstream calls (such as 
// listeners, component check functions, and interceptors).
Check(ctx context.Context) CheckerResult

Let me know of the base design and the extension make sense (even just implementing the base design would be helpful).

What would you think about adding a Skipper to a Check config instead (similar how it's done in Echo, see here: https://echo.labstack.com/middleware/#skipping-middleware). The skipper could evaluate authorization information from an HTTP request and see if a specific check allowed be returned in the aggregated result.

Realistically, the user would probably need a custom middleware that extracts user specific authorization details from the HTTP request, put it into the context, where the skipper could read from to make a decision:

Example:

 // this middleware puts user details into the context, say a user role
 health.WithMiddleware(extractUserRoleFromRequestMiddleware),
//...

health.WithCheck(health.Check{
    Name: "maybeSkippedCheck",
    Check: func(ctx context.Context) error {
        // do the check
    },
    Skipper: func(context.Context ctx) bool {
       // if a role is in the context and role equals admin, return false (= do not skip)
       // else return true (=skip) 
    }
}),
PV99 commented

Hey Alex, a Skipper seems like a great solution. I'd be happy to implement that feature. If we merge this feature in, would you be willing to publish a new version of this package with the feature?

@PV99 Yes, bu I think it will require a bit of refactoring. A few things that come into my mind are:

  • For both, synchronous and periodic checks, a Skipper should probably skip the result in the aggregated result. For synchronous checks, a skipper probably should also skip the execution of the synchronous check.
  • At the moment, this library keeps a global state with an aggregated result that is computed based of all individually configured check functions. If the health endpoint would now get requests that only include a subset of the checks, a global aggregated state may not be representative for the response (imagine a check A is "down", but the HTTP request skips A, the cached overall aggregated result is still "down"). Hence, we would need to exclude skipped checks from the aggregated result, but this means that (1) caching the overall result is not useful anymore and (2) the overall result needs to be reevaluated on every HTTP request.
  • This library supports Interceptors, that may listen to global status changes (e.g., to log global system health status changes). When we would have a Skipper, it's unclear at the moment what the best behaviour would be when one request evaluates checks A,B,C and another request D,E,F with different aggregated results.

It may be an alternative to use different health endpoints that expose different checks (e.g., one "admin" endpoint, and one generic endpoint). This would mean that the checks need to be executed twice though, but that may not be a problem.

PV99 commented

@alexliesenfeld thanks for the thorough thoughts, these are all great points.

  1. "For both, synchronous and periodic checks, a Skipper should probably skip the result in the aggregated result. For synchronous checks, a skipper probably should also skip the execution of the synchronous check."

    • Mostly agree, though wouldn't we also skip the execution of the periodic check if the skipper on the period check is firing? As a separate thought, maybe the package should only offers Skippers on synchronous checks (e.g. WithSynchronousCheckSkipper). Skips on period checks may be useful, but the implications on the API will be messy.
  2. At the moment, this library keeps a global state with an aggregated result that is computed based of all individually configured check functions. If the health endpoint would now get requests that only include a subset of the checks, a global aggregated state may not be representative for the response (imagine a check A is "down", but the HTTP request skips A, the cached overall aggregated result is still "down"). Hence, we would need to exclude skipped checks from the aggregated result, but this means that (1) caching the overall result is not useful anymore and (2) the overall result needs to be reevaluated on every HTTP request.

    • Yeah, this one is a little more unclear. I think having an aggregated state is still useful, developers may want to have global status of all checks (regardless of if they were skipped) and then have the flexibility to post process / filter as necessary. In this scenario, I think introducing a field "SkippedOnLastRun: bool" to CheckResult and another field "AvailabilityStatusUnskippedChecks: bool" to CheckerResult would generate clear expectations in the API. If a developer did skip some checks, they'd probably be conscious enough to know whether they want to use "AvailabilityStatus" or "AvailabilityStatusUnskippedChecks"
  3. This library supports Interceptors, that may listen to global status changes (e.g., to log global system health status changes). When we would have a Skipper, it's unclear at the moment what the best behaviour would be when one request evaluates checks A,B,C and another request D,E,F with different aggregated results.

    • I think the best behavior here is to still wait for global status changes (e.g. if checks A, B, C, E, F, and D are all down, then one request to checks D, E, F passes, we don't run the global status interceptor). We could create another interceptor for GlobalStatusUnskippedChecks. Preferably, the developer should just be able to handle these cases creatively through the individual check interceptors.
  4. It may be an alternative to use different health endpoints that expose different checks (e.g., one "admin" endpoint, and one generic endpoint). This would mean that the checks need to be executed twice though, but that may not be a problem.

    • Thanks for suggesting this alternative. Sadly, it won't work for the use case. The use case requires fine grain control to only run one check at a time (e.g if there are many admin checks and expensive synchronous checks, we can't run them all together at once). This can only be accomplished by being able to run one check at a time (which can also be implemented via skippers).

This issue is stale because it has been open for 30 days with no activity.

This issue was closed because it has been inactive for 14 days since being marked as stale.