sample alerting rules?
Closed this issue · 2 comments
what's a good started pack of alerting rules for this? i remember seeing a metric about exceptions, for example, but now that we fixed and deployed that, i can't find it anymore...
is there a list of metrics somewhere we can peruse?
wow, i must admit that ("ask $LLM") is a first on github for me.
i was hoping someone might already have had some rules in production, or an index of metrics. but since you asked, let's see what's actually out there so other people looking for this information don't have to go through what i did and waste all this power generating a fresh new answer... In a spirit of collaboration:
When using
django-prometheusto monitor your Django application, an effective set of alerting rules can help you catch issues early and maintain the health of your application.django-prometheusexposes various metrics that can be utilized for setting up these alerts.Basic Monitoring Metrics
Here are some common metrics exposed by
django-prometheusthat you might want to use for setting up alerting rules:
Request Latency:
- Metric:
http_request_duration_seconds- Description: Measures the duration of HTTP requests.
Request Count:
- Metric:
http_requests_total- Description: Count of total HTTP requests received by the application.
Request Exceptions:
- Metric:
django_http_requests_exceptions_total- Description: Number of exceptions raised during HTTP requests.
Database Query Count:
- Metric:
django_db_queries_total- Description: Total number of database queries executed.
Cache Operations Count:
- Metric:
django_cache_get_total,django_cache_set_total,django_cache_delete_total- Description: Count of cache get/set/delete operations.
Model Saves Count:
- Metric:
django_model_inserts_total,django_model_updates_total,django_model_deletes_total- Description: Total number of inserts, updates, and deletes on models.
Middleware Exceptions:
- Metric:
django_middleware_exceptions_total- Description: Number of exceptions raised by middleware.
Setting Up Alerting Rules
Here are some basic alert rules you might consider:
High Exception Rate:
- alert: HighExceptionRate expr: rate(django_http_requests_exceptions_total[5m]) > 5 for: 5m labels: severity: warning annotations: summary: "High rate of exceptions" description: "More than 5 exceptions per minute over the last 5 minutes."High Request Latency:
- alert: HighRequestLatency expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 10m labels: severity: critical annotations: summary: "High request latency" description: "99th percentile request latency is greater than 500ms over the last 10 minutes."High Database Query Count:
- alert: HighDBQueryCount expr: rate(django_db_queries_total[5m]) > 100 for: 10m labels: severity: warning annotations: summary: "High database query count" description: "More than 100 database queries per minute over the last 10 minutes."High Cache Miss Rate:
- alert: HighCacheMissRate expr: rate(django_cache_get_total{result="miss"}[5m]) / rate(django_cache_get_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "High cache miss rate" description: "Cache miss rate is greater than 10% over the last 5 minutes."Finding Metrics
To find the complete list of metrics exposed by
django-prometheus, you can:
Explore the Metrics endpoint: Navigate to your Django application’s
/metricsendpoint in your browser or use a tool likecurl.curl http://<your-django-app>/metricsDocumentation: Check the official django-prometheus documentation for more details on available metrics.
These steps should help you establish a good monitoring foundation for your Django application using
django-prometheus. Adjust thresholds and metrics as needed based on your application's specific requirements and workload.
The model is GPT-4o and the prompt was:
what's a good started pack of alerting rules for django-prometheus? i remember seeing a metric about exceptions, for example, but now that we fixed and deployed that, i can't find it anymore...
is there a list of metrics somewhere we can peruse?
I would say GPT answered the first question pretty well, and correctly identified the django_http_requests_exceptions_total metric I was looking for, although I wouldn't alert on the rate there. I'd like to know if there are any exceptions percolating out of the app, at least in production...
It looks like metrics are mostly defined in the middleware.py file. I looked there because I thought (incorrectly, at first) that the exporter wouldn't show metrics (like exceptions) unless they had a value, but it turns that, indeed, just dumping the /metrics endpoint is good enough. Here's what it looks like in our deployment:
# curl -s https://example.torproject.net/metrics | grep HELP
# HELP python_gc_objects_collected_total Objects collected during gc
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# HELP python_gc_collections_total Number of times this generation was collected
# HELP python_info Python platform information
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# HELP process_resident_memory_bytes Resident memory size in bytes.
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# HELP process_open_fds Number of open file descriptors.
# HELP process_max_fds Maximum number of open file descriptors.
# HELP django_model_inserts_total Number of insert operations by model.
# HELP django_model_updates_total Number of update operations by model.
# HELP django_model_deletes_total Number of delete operations by model.
# HELP django_migrations_unapplied_total Count of unapplied migrations by database connection
# HELP django_migrations_applied_total Count of applied migrations by database connection
# HELP django_db_new_connections_total Counter of created connections by database and by vendor.
# HELP django_db_new_connections_created Counter of created connections by database and by vendor.
# HELP django_db_new_connection_errors_total Counter of connection failures by database and by vendor.
# HELP django_db_execute_total Counter of executed statements by database and by vendor, including bulk executions.
# HELP django_db_execute_created Counter of executed statements by database and by vendor, including bulk executions.
# HELP django_db_execute_many_total Counter of executed statements in bulk operations by database and by vendor.
# HELP django_db_errors_total Counter of execution errors by database, vendor and exception type.
# HELP django_db_query_duration_seconds Histogram of query duration by database and vendor.
# HELP django_db_query_duration_seconds_created Histogram of query duration by database and vendor.
# HELP django_http_requests_before_middlewares_total Total count of requests before middlewares run.
# HELP django_http_requests_before_middlewares_created Total count of requests before middlewares run.
# HELP django_http_responses_before_middlewares_total Total count of responses before middlewares run.
# HELP django_http_responses_before_middlewares_created Total count of responses before middlewares run.
# HELP django_http_requests_latency_including_middlewares_seconds Histogram of requests processing time (including middleware processing time).
# HELP django_http_requests_latency_including_middlewares_seconds_created Histogram of requests processing time (including middleware processing time).
# HELP django_http_requests_unknown_latency_including_middlewares_total Count of requests for which the latency was unknown (when computing django_http_requests_latency_including_middlewares_seconds).
# HELP django_http_requests_unknown_latency_including_middlewares_created Count of requests for which the latency was unknown (when computing django_http_requests_latency_including_middlewares_seconds).
# HELP django_http_requests_latency_seconds_by_view_method Histogram of request processing time labelled by view.
# HELP django_http_requests_latency_seconds_by_view_method_created Histogram of request processing time labelled by view.
# HELP django_http_requests_unknown_latency_total Count of requests for which the latency was unknown.
# HELP django_http_requests_unknown_latency_created Count of requests for which the latency was unknown.
# HELP django_http_ajax_requests_total Count of AJAX requests.
# HELP django_http_ajax_requests_created Count of AJAX requests.
# HELP django_http_requests_total_by_method_total Count of requests by method.
# HELP django_http_requests_total_by_method_created Count of requests by method.
# HELP django_http_requests_total_by_transport_total Count of requests by transport.
# HELP django_http_requests_total_by_transport_created Count of requests by transport.
# HELP django_http_requests_total_by_view_transport_method_total Count of requests by view, transport, method.
# HELP django_http_requests_total_by_view_transport_method_created Count of requests by view, transport, method.
# HELP django_http_requests_body_total_bytes Histogram of requests by body size.
# HELP django_http_requests_body_total_bytes_created Histogram of requests by body size.
# HELP django_http_responses_total_by_templatename_total Count of responses by template name.
# HELP django_http_responses_total_by_templatename_created Count of responses by template name.
# HELP django_http_responses_total_by_status_total Count of responses by status.
# HELP django_http_responses_total_by_status_created Count of responses by status.
# HELP django_http_responses_total_by_status_view_method_total Count of responses by status, view, method.
# HELP django_http_responses_total_by_status_view_method_created Count of responses by status, view, method.
# HELP django_http_responses_body_total_bytes Histogram of responses by body size.
# HELP django_http_responses_body_total_bytes_created Histogram of responses by body size.
# HELP django_http_responses_total_by_charset_total Count of responses by charset.
# HELP django_http_responses_total_by_charset_created Count of responses by charset.
# HELP django_http_responses_streaming_total Count of streaming responses.
# HELP django_http_responses_streaming_created Count of streaming responses.
# HELP django_http_exceptions_total_by_type_total Count of exceptions by object type.
# HELP django_http_exceptions_total_by_view_total Count of exceptions by view.
I have finally found recording rules in the examples folder, but those cannot directly be used for alerting.
Interestingly, GPT screwed up the latency metrics: there's no such thing as a http_request_duration_seconds_bucket metric in this exporter, from what i can tell. It seems the closest would be django_http_requests_latency_seconds_by_view_method_bucket.
So far I ended up with this:
groups:
- name: tpa_donate
rules:
- alert: DjangoExceptions
expr: sum(increase(django_http_requests_exceptions_total[5m])) by (app, namespace) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Site {{ $labels.app }} ({{ $labels.namespace }}) generating exceptions"
description: "Django site {{ $labels.app }} (in {{ $labels.namespace }}) generated {{ $value }} exceptions in the last 5 minutes"
runbook: TODO
- alert: DjangoHighRequestLatency
expr: sum(histogram_quantile(0.99, sum(rate(django_http_requests_latency_seconds_by_view_method_bucket[5m])) without (view, method))) by (app,namespace)
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on site {{ $labels.app }} ({{ $labels.namespace }})"
description: "99th percentile request latency is greater than 500ms ({{ $value }}) over the last 10 minutes on Django site {{ $labels.app }} ({{ $labels.namespace }})"
runbook: TODO... but from what I gathered, you do not seem interested in having such a sample file here, so I'll just take my toys and go play elsewhere.
Have a nice day! :)