improve Prometheus integration

Question

improve Prometheus integration

thesuperzapper opened this issue 2 years ago · 0 comments

While we currently have limited support for Prometheus through our serviceMonitor.* and prometheusRule.* values, we can make the situation much better.

Tasks:

Implement prometheus/statsd_exporter
1. OPTION 1: as a sidecar container for airflow pods
  - This option is recommended by the creators of statsd_exporter (need to check why, but its probably so that Prometheus associates the metrics to the actual Pod that is generating them)
  - I am not sure if all Pods will need the sidecar or just the scheduler
  - We would then configure AIRFLOW__METRICS__STATSD_HOST to be localhost
  - We would annotate the Pods with the sidecar to have prometheus.io/scrape: "true" and prometheus.io/port: "xxxx"
2. OPTION 2: as a central deployment
  - This would reduce the number of containers
  - We would then configure AIRFLOW__METRICS__STATSD_HOST to be the service of this deployment (but this is possibly a security risk, as other pods could send bad data if no NetworkPolicy prevents invalid access)
  - We would annotate the Deployment Pods to have prometheus.io/scrape: "true" and prometheus.io/port: "xxxx"
Implement prometheus-community/pgbouncer_exporter
- This is probably best implemented as a sidecar of our PgBouncer Deployment
- We would annotate the Deployment Pods to have prometheus.io/scrape: "true" and prometheus.io/port: "xxxx"
Consider what to do with the Prometheus Operator resource values
- The existing serviceMonitor.* and prometheusRule.* values could be automatically configured (but there is an argument that these are configs for the user's Prometheus, and should not be managed by the chart).
- For some reason, these resources are currently stored under the template/webserver/ folder (when they are not really specific to the webserver)
Update the docs about Prometheus
- Update the "How to integrate airflow with Prometheus?" page
- Add some example Grafana dashboards for airflow

Completing these tasks should replace the need for the following issues:

#326
#274