fgouteroux/terraform-provider-mimir

mimir_alertmanager_config doesn't support non-configured Alertmanager

shybbko opened this issue · 14 comments

If mimir_alertmanager_config resource is deployed to a new Alertmanager instance that hasn't been configured with initial config (which is the default case ie. for Mimir Helm chart), an error is reported alertmanager is not configured. A solution to that would be to ie. manually inject some random config into Alertmanager via Mimirtool before running Terraform, however this is a bit counterproductive, because why configure something before configuration can be deployed? :D Ideally mimir_alertmanager_config would work fine against a fresh, unconfigured instance of Alertmanager.

The "old" Cortex provider's (https://registry.terraform.io/providers/inuits/cortex/latest/docs) resource cortex_alertmanager works fine against both unconfigured Cortex and Mimir in the scenario described.

One more thing - when I stumbled upon this issue, I worked through it by manually injecting some random config into Alertmanager via Mimirtool, however then stumbled upon error "invalid character 'e' in literal true (expecting 'r') which I didn't debug. Possibly it might be something going on in my config's syntax (although I copied it from provider's example). This part is just FYI.

@shybbko I never had this issue. Which version of mimir are you using ?

I never tested with helm but it should works. Acc test didn't report this issue, so I need more details to be able to reproduce it.

@shybbko could you please share your tf version and the tf resource declaration and the error log ?

I expect (although haven't verified) that the non-Helm Mimir might come with Alertmanager already bootstrapped with some default config. The Helm one does not: https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L128

I'll be able to share other details tomorrow. I encountered this issue last week when migrating Cortex to Mimir and the success of the migration itself took priority, so I can only hope I'll be able to dig out all the details required. Fingers crossed ;)

Okay, so I've got stable reproduction:

Step 1: Deploy Mimir via Helm (no configuration provided, everything is at defaults):

resource "kubernetes_namespace" "mimir-namespace2" {
  metadata {
    name = "mimirtestns"
  }
}

resource "helm_release" "mimir2" {
  name       = "mimirtest"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "mimir-distributed"
  version    = 3.3
  timeout    = 1800
  namespace  = "mimirtestns"

  depends_on = [
    kubernetes_namespace.mimir-namespace2
  ]
}

Step 2: port-forward alertmanager (to provide network connectivity for API calls, as we're skipping ingress bits):

kubectl port-forward mimirtest-alertmanager-0 80:8080

Step 3: try to deploy some basic alertmanager config:

resource "mimir_alertmanager_config" "testconfig" {
  route {
    group_by = ["..."]
    group_wait = "30s"
    group_interval = "5m"
    repeat_interval = "1h"
    receiver = "pagerduty"
  }
  receiver {
    name = "pagerduty"
    pagerduty_configs {
      routing_key = "secret"
      details = {
        environment = "dev"
      }
    }
  }
}

terraform {
  required_providers {
    mimir = {
      source  = "fgouteroux/mimir"
      version = "0.0.6"
    }
  }
  required_version = ">= 0.14"
}

provider "mimir" {
  ruler_uri        = "http://127.0.0.1/prometheus"
  alertmanager_uri = "http://127.0.0.1/alertmanager"
  org_id           = "fake"
}

Step 4: observe the error:

│ Error: Cannot create alertmanager config Unexpected response code '412': the Alertmanager is not configured
│
│
│   with module.observability-mimir.mimir_alertmanager_config.testconfig,
│   on ../../modules/observability-mimir/mimir_test.tf line 22, in resource "mimir_alertmanager_config" "testconfig":
│   22: resource "mimir_alertmanager_config" "testconfig" {
│

Now what I did next was to initialize Alertmanager via some random config by using Mimirtool (https://grafana.com/docs/mimir/latest/operators-guide/tools/mimirtool/#alertmanager)

Step 5: download Mimirtool

wget https://github.com/grafana/mimir/releases/download/mimir-2.5.0/mimirtool-linux-amd64

Step 6: create some basic config file:

# cat alertmanager_initial_config.yaml
route:
  receiver: "example_receiver"
  group_by: ["example_groupby"]
receivers:
  - name: "example_receiver"

Step 7: deploy the config:

./mimirtool-linux-amd64 alertmanager load alertmanager_initial_config.yaml --address http://127.0.0.1 --id fake

Step 8: try to redeploy the Terraform configs:

│ Error: Cannot create alertmanager config Unexpected response code '400': {"status":"error","errorType":"bad_data","error":"invalid character 'e' in literal true (expecting 'r')"}
│
│   with module.observability-mimir.mimir_alertmanager_config.testconfig,
│   on ../../modules/observability-mimir/mimir_test.tf line 22, in resource "mimir_alertmanager_config" "testconfig":
│   22: resource "mimir_alertmanager_config" "testconfig" {

My Terraform version is the following (the issue still occurs if I use TF 1.3.6, I've tried, I'm just using an older version because of compatibility reasons with the rest of my stack):

# terraform --version
Terraform v1.0.10
on linux_amd64
+ provider registry.terraform.io/cloudflare/cloudflare v3.16.0
+ provider registry.terraform.io/fgouteroux/mimir v0.0.6
+ provider registry.terraform.io/hashicorp/archive v2.2.0
+ provider registry.terraform.io/hashicorp/external v2.2.3
+ provider registry.terraform.io/hashicorp/google v3.90.1
+ provider registry.terraform.io/hashicorp/google-beta v4.46.0
+ provider registry.terraform.io/hashicorp/helm v2.8.0
+ provider registry.terraform.io/hashicorp/http v2.2.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.16.1
+ provider registry.terraform.io/hashicorp/null v3.2.1
+ provider registry.terraform.io/hashicorp/random v3.4.3
+ provider registry.terraform.io/hashicorp/time v0.9.1
+ provider registry.terraform.io/hashicorp/tls v4.0.4
+ provider registry.terraform.io/inuits/cortex v0.0.3

Your version of Terraform is out of date! The latest version
is 1.3.6. You can update by downloading from https://www.terraform.io/downloads.html

@shybbko grafana/mimir#3360 they add a fallback alertmanager config. But I don't see it in the 3.3 helm version

That's correct, that's not included in Helm 3.3 / Mimir 2.4.

It might be the case that the issue described cannot occur in Helm 4.0 / Mimir 2.5 (didn't verify that though), however in my personal opinion it shouldn't be the case that the provider / Terraform resource that is supposed to configure Alertmanager requires Alertmanager be configured before it can be configured.

The legacy Cortex provider https://registry.terraform.io/providers/inuits/cortex/latest/docs works fine with unconfigured Mimir Alertmanager.

Ok I'm able to reproduce it without helm deployment, I will investigate.

@shybbko the alertmanager uri is not good, remove the /alertmanager (default prefix for the alertmanager UI) and it will working fine.

Like this

provider "mimir" {
  ruler_uri        = "http://127.0.0.1/prometheus"
  alertmanager_uri = "http://127.0.0.1"
  org_id           = "fake"
}

Thanks for verifying this!

So I think the logic in provider config might be misleading - considering http://127.0.0.1 is Mimir URI then either Ruler URI should not require /prometheus suffix or Alertmanager should accept /alertmanager, as the two are now requiring different logic?

Maybe it's worth separating suffixes from URIs?

This is related to Grafana Mimir HTTP API .
For Ruler there is a prometheus-http-prefix and for the alertmanager there is no http prefix for the api.

Ruler API (default prefix: /prometheus):

  • GET /<prometheus-http-prefix>/config/v1/rules/{namespace}/{groupName}
  • POST /<prometheus-http-prefix>/config/v1/rules/{namespace}
  • DELETE /<prometheus-http-prefix>/config/v1/rules/{namespace}/{groupName}

Alertmanager Configuration API:

  • GET /api/v1/alerts
  • POST /api/v1/alerts
  • DELETE /api/v1/alerts

But you can decide to override the ruler prefix path to / to unify the api path but you have to take care of what grafana mimir component are enabled in the same process.

I think that overriding paths on Mimir end might lead to unforseen results. Let us not go there.

for the alertmanager there is no http prefix for the api.

Correct, but this might be confusing - the default Alertmanager prefix for API is /, while for Alertmanager UI it's /alertmanager. Actually it's nginx living under (in this scenario) 127.0.0.1 that's redirecting everything to proper pods / services:

          # Alertmanager endpoints
          location {{ template "mimir.alertmanagerHttpPrefix" . }} {
            proxy_pass      http://{{ template "mimir.fullname" . }}-alertmanager-headless.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.serverHttpListenPort" . }}$request_uri;
          }
          location = /multitenant_alertmanager/status {
            proxy_pass      http://{{ template "mimir.fullname" . }}-alertmanager-headless.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.serverHttpListenPort" . }}$request_uri;
          }
          location = /api/v1/alerts {
            proxy_pass      http://{{ template "mimir.fullname" . }}-alertmanager-headless.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.serverHttpListenPort" . }}$request_uri;
          }
          # Ruler endpoints
          location {{ template "mimir.prometheusHttpPrefix" . }}/config/v1/rules {
            proxy_pass      http://{{ template "mimir.fullname" . }}-ruler.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.serverHttpListenPort" . }}$request_uri;
          }
          location {{ template "mimir.prometheusHttpPrefix" . }}/api/v1/rules {
            proxy_pass      http://{{ template "mimir.fullname" . }}-ruler.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.serverHttpListenPort" . }}$request_uri;
          }
          location {{ template "mimir.prometheusHttpPrefix" . }}/api/v1/alerts {
            proxy_pass      http://{{ template "mimir.fullname" . }}-ruler.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.serverHttpListenPort" . }}$request_uri;
          }
          location = /ruler/ring {
            proxy_pass      http://{{ template "mimir.fullname" . }}-ruler.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.serverHttpListenPort" . }}$request_uri;
          }
          # Rest of {{ template "mimir.prometheusHttpPrefix" . }} goes to the query frontend
          location {{ template "mimir.prometheusHttpPrefix" . }} {
            proxy_pass      http://{{ template "mimir.fullname" . }}-query-frontend.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.serverHttpListenPort" . }}$request_uri;
          }
          }

So I think that provider migh benefit from this tiny redesign (or updated docs to emphasise the importance of prefixes and what are the default ones to use).
My three cents anyway. Won't push you to make changes though ;) Thanks for your time!

Yes I will add a warning on provider uri path. If you agree I will close this issue.

Certainly. Thanks again!