fluxcd/flagger

Issue with Canary Deployment: Metric Not Reporting

Closed this issue · 4 comments

I'm implementing a canary deployment using Flagger to monitor my application. The goal is to monitor the success rate of HTTP requests to a health endpoint (/ping). However, despite configuring the request-success-rate metric, Flagger isn't sending any metrics or requests to the endpoint. I am using traefik provider.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: test-service
  namespace: test
spec:
  provider: traefik
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-service
  progressDeadlineSeconds: 300
  service:
    port: 3000
    targetPort: 3000
  analysis:
    interval: 10s
    threshold: 10
    maxWeight: 50
    stepWeight: 5
    metrics:
      - name: request-success-rate
        interval: 30s
        thresholdRange:
          min: 99
        failureThreshold: 5
        query: "http://test-service:3000/ping"
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 10s
        metadata:
          type: bash
          cmd: "curl -X GET http://test-service:3000/ping"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          type: cmd
          cmd: "hey -z 10s -q 10 -c 2 http://test-service:3000/ping"
          logCmdOutput: "true"
{{- end }}

I tested the curl and hey commands from inside the load tester pod and they work fine. But when I check my canary, it goes in failed status after initialized

Events:
Type Reason Age From Message


Warning Synced 4m19s flagger test-service-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation
Warning Synced 3m29s (x5 over 4m9s) flagger test-service-primary.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
Normal Synced 3m19s (x7 over 4m19s) flagger all the metrics providers are available!
Normal Synced 3m19s flagger Initialization done! test-service.test
Normal Synced 2m49s flagger New revision detected! Scaling up test-service.test
Warning Synced 119s (x5 over 2m39s) flagger canary deployment test-service.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
Normal Synced 109s flagger Starting canary analysis for test-service.test
Normal Synced 109s flagger Pre-rollout check acceptance-test passed
Normal Synced 109s flagger Advance test-service.test canary weight 5
Warning Synced 89s (x2 over 99s) flagger Halt advancement no values found for traefik metric request-success-rate probably test-service.test is not receiving traffic: running query failed: no values found

I am not sure if I am missing something.

could you test if the required metrics are showing in your prometheus server?

could you test if the required metrics are showing in your prometheus server?

@aryan9600 I do not have a prometheus server. I am using metrics-server. I was reading more on canary and I think prometheus is a requirement for this setup. But I am running the podinfo canary there(https://github.com/stefanprodan/podinfo) and it works fine even without prometheus. I am not sure why that is working and not my custom service.

The goal is to monitor the success rate of HTTP requests to a health endpoint (/ping).

The query field is for specifying a PromQL query, see the docs here: https://docs.flagger.app/usage/metrics#prometheus

If you don't use Prometheus, then delete the metrics field, the webhooks are enough to test the ping endpoint.

The goal is to monitor the success rate of HTTP requests to a health endpoint (/ping).

The query field is for specifying a PromQL query, see the docs here: https://docs.flagger.app/usage/metrics#prometheus

If you don't use Prometheus, then delete the metrics field, the webhooks are enough to test the ping endpoint.

@stefanprodan the podinfo canary that you created, that is working fine with my setup(without prometheus). I am just wondering how is that working with the metrics field? And just to confirm, you are saying that I should remove the entire block below?

metrics:
      - name: request-success-rate
        interval: 30s
        thresholdRange:
          min: 99
        failureThreshold: 5
        query: "http://test-service:3000/ping"