This project contains the auto-heal service. It receives alert notifications from the Prometheus alert manager and executes Ansible playbooks to resolve the root cause.
Most of the configuration of the auto-heal service is kept in a YAML
configuration file. The name of the configuration file is specified using the
--config-file
command line option. If this option isn't explicitly given then
the service will try to load the autoheal.yml
file from the current working
directory.
In addition to the configuration file the auto-heal service also uses command
line options to configure the connection to the Kubernetes API and the log
level. Use the -h
option to get a complete list of these command line options.
The --kubeconfig
command line option is used to specify the location of the
Kubernetes client configuration file. When running outside of a Kubernetes
cluster the auto-heal service will use $HOME/.kube/config
by default, the same
used by the kubectl
command. When running inside a Kubernetes cluster it will
use the configuration that Kubernetes mounts automatically in the pod file
system. So in most cases this command line option won't have to be explicitly
included.
The --logtostderr
option is very convenient when running the auto-heal
service, both in development and production environments.
Assuming that you want to have your own my.yml
configuration file a typical
command line will be the following:
$ autoheal server --config-file=my.yml --logtostderr
See the autoheal.yml
file for a complete example.
The first section of the configuration file is named awx
and it contains all
the details needed to connect to the AWX
or Ansible Tower server:
awx:
address: https://myawx.example.com/api
proxy: http://myproxy.example.com:3128
credentialsRef:
namespace: my-namespace
name: my-awx-credentials
tlsRef:
namespace: my-namespace
name: my-awx-ca
project: "Auto-heal"
The address
parameter is the URL of the API of the AWX server. It should
contain the /api
suffix, but not the /v1
or /v2
suffix, as the auto-heal
service will internally decide which version to use.
The proxy
parameter is optional, and it indicates what HTTP proxy should be
used to connect to the AWX API. If this parameter is not specified, or if it is
empty, then the connection will be direct to the AWX server, without a proxy.
The credentialsRef
parameter is a reference to the Kubernetes
secret that contains
the user name and password used to connect to the AWX API. That secret should
contain the username
and password
keys. For example:
apiVersion: v1
kind: Secret
metadata:
namespace: my-namespace
name: my-awx-credentials
data:
username: YWxlcnQtaGVhbGVy
password: ...
The tlsRef
parameter is a reference to the Kubernetes
secret that contains
the certificates used to connect to the AWX API. That secret should contain the
ca.crt
key, for example:
apiVersion: v1
kind: Secret
metadata:
namespace: my-namespace
name: my-awx-tls
data:
ca.crt: |-
LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvVENDQWVXZ0F3SUJBZ0lKQUxNRXB6OWxa
VkVzdzI3Sm5BYlMyejNhbUF0YTc1QmNnVGcvOUFCdDV0VVc2VTJOKzkKbXc9PQotLS0tLUVORCBD
...
The project
parameter is the name of the AWX project that contains the job
templates that will be used to run the playbooks.
The second important section of the configuration file is rules
. It contains
the list of healing rules used by the auto-heal service to decide which action
to run for each received alert. For example:
rules:
- metadata:
name: start-node
labels:
alertname: "NodeDown"
awxJob:
template: "Start node"
extraVars: |-
{
"node": "{{ $labels.instance }}"
}
- metadata:
name: start-service
labels:
alertname: ".*Down"
service: ".*"
awxJob:
template: "Start service"
The above example contains two healing rules. The first rule will be
executed when the alert received contains a label named alertname
with
a value that matches the regular expression NodeDown
.
The second rule will be executed when the alert received contains a
labels alertname
and service
, matching the regular expressions
.*Down
and .*
respectively.
The metadata
parameter of each rule is used to specify the name
of
the rule, which is used by the auto-heal service to reference it in log
messages and in metrics.
The labels
and annotations
parameters of a rule are maps of strings
used to specify the labels and annotations that the alerts should
contain in order to match the rule. The keys of these maps are the names
of the labels or annotations. The values of these maps are regular
expressions that the values of those labels or annotations should match.
The awxJob
parameter indicates which job template should be executed
when an alert matches the rule.
The template
parameter is the name of the AWX job template.
The extraVars
parameter is optional, and if specified it is used to
pass additional variables to the playbook, like with the --extra-vars
option of the ansible-playbook
command.
Note that in order to be able to use this
extraVars
mechanism the AWX job template should have the Prompt on lauch box checked, otherwise the variables passed will be ignored.
The values of all the parameters inside awxJob
are processed as Go
templates before executing the
job. These templates receive the details of the alert inside the
$labels
and $annotations
variables. For example, to generate
dynamically the name of the job templates to execute from the value of
the template
annotation of the alert:
awxJob:
template: "{{ $annotations.template }}"
Or to pass a variable node
to the playbook, calculated from the
instance
label:
awxJob:
template: "My template"
extraVars: |-
{
"node": "{{ $labels.node }}"
}
To run the automated tests of the project run this command:
$ make check
To manually test the service, without having to have a running Prometheus alert
manager that generates the alert notifications, you can use the *-alert.json
files that are inside the manifests
directory. For example, to simulate the
NodeDown
alert start the server and then use curl to
send the alert notification:
$ autoheal server --config-file=my.yml --logtostderr
$ curl --data @manifests/node-down-alert.json http://localhost:9099/alerts
To build the binary run this command:
$ make
To build the RPM and the images, run this command:
$ make build-images