Prometheus Workshop
Requirements
software name | recommended version |
---|---|
docker | 20.10.12 |
docker-compose | 1.29.2 |
The versions of the software does not have to match exactly. However, recommended versions have been proven to work.
Instructions
Execute ./run.sh -f
command to and start up prometheus,
alertmanager and a mock service containers.
Execute ./run.sh
to do the same but additionally check the test
results beforehand.
The mock server will provide some dummy metrics that you can test your promql expressions against.
Tasks
In the rules
directory you will find various
test
files.
To make them pass, you have to provide appropriate
recording rules
or
alerting rules
which configure
the expected behaviour. That rule file has have a specific name. If
the test file's path is, for example, rules/alerts/some-alert.test.yml
then the rule file's path should be rules/alerts/some-alert.yml
.
Implement an alert based on service availability
Rule file path: rules/alerts/instance-health-rules.yml
Whenever there is a service instance not available for 2 minutes or more,
an alert called Instance_Down
should be raised.
Hints:
- use
for
to trigger alert only if a condition lasts for a certain amount of time - use a build in
labels
variable to reference labels from a given expression
Implement an alert based on service response codes
Rule file path: rules/alerts/errors-rules.yml
Whenever the rate of errors is over 25%, an alert called High_Error_Rate
should be raised.
Hints:
- use a build in
value
variable to reference the value of a given expression - convert the value into percents to improve human readibility
Implement a recording rule based on service response latency histogram
Rule file path: rules/recording/ovp-metrics-rules.yml
There should be three metrics derived from a histogram called
z_application_latency_seconds_bucket
:
application_latency_50
(the 50th percentile of response times)application_latency_98
(the 98th percentile of response times)application_response_rate
(responses produced a minute)
Hints:
- use
histogram_quantile
function for the quantiles calculation - use the metric with
le="+Inf"
label to derive the response rate