terraform-google-modules/terraform-google-vault

Add auto-healing support

jeffmccune opened this issue · 1 comments

This ticket is intended to track comments, design decisions regarding auto-healing support for the vault cluster managed instance group. I'm planning on submitting targeted patches extracted from https://github.com/openinfrastructure/terraform-google-vault/tree/ois/1.0.0/modules/55_vault_cluster

Proposal

  1. Add a auto-heal specific health check
  2. Add an input var, autoheal_health_check_provided which defaults to false so the health check behavior may be tuned. If true, the module uses var.autoheal_health_check_self_link instead of managing a health check.
  3. Modify the instance group to use the health check.
  4. Verify rolling updates work as expected.

Two health checks will be used. The existing check is intended for signaling the LB. The new check is intended for signaling the MIG resource to auto-heal.

The reason for 2 health checks is to allow a vault server to have traffic redirected by the LB without being auto-healed. This is useful to avoid sending traffic to a standby node, which forwards requests to the active node anyway (ref).

Add the following health check to vault-cluster/main.tf:

resource "google_compute_health_check" "autoheal" {
  project = var.project_id
  name    = "vault-health-autoheal${var.name_suffix}"

  check_interval_sec  = 10
  timeout_sec         = 5
  healthy_threshold   = 1
  unhealthy_threshold = 2

  https_health_check {
    port         = var.vault_port
    request_path = var.hc_autoheal_request_path
  }
}

var.hc_autoheal_request_path defaults to "/v1/sys/health?uninitcode=200&standbyok=true" so that standby nodes are not marked as unhealthy. Without standbyok=true, standby nodes return a non-200 status code.

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days