/loghost-boshrelease

BOSH release gather, store and analyze logs generated by bosh VMs

Primary LanguageHTML

Table of Contents

loghost-boshrelease

This is a BOSH release to gather, store and analyze syslog events forwarded by bosh VMs. It currently uses RSyslog which is pre-installed by the stemcell.

Only Linux stemcells are supported at the moment.

Introduction

Usually, platform logs are sent to ELK stacks which store and index events on-the-fly. Finally, users can build fancy Kibana dashboards by extracting metrics from elasticsearch queries.

With the development of micro-services architectures, the number of emitted logs recently exploded, making these ELKs very hardware and, therefore, money consuming. Even more, these stacks are often built with heavy redundancy and high availability even when most of the emitted events are not critical.

The idea here is having a much more lightweight architecture, providing only the most essential features of log processing:

  • midterm storage for debug and production incident analysis
  • hardware-efficient generation of metrics
  • redundancy and availability matching the actual criticality of the logs

This is achieved by using both good old technologies such as RSyslog and modern tools like Prometheus. The bridge between logs and metrics is provided by a brilliant tool grok_exporter.

Components

Concentrator

The job loghost_concentrator configures local rsyslogd to store received logs to persistent disk.

Format

Only syslog events received in RFC5424 format with instance@47450 in Structured Data ID are handled. The 47450 private enterprise number is the one generated by the syslog-release generally used to forward VM log events to a given endpoint.

Generated files

Received logs are stored on persistent disk in root directory /var/vcap/store/loghost where {path} depends on parsed Structured Data ID fields of the event.

Assuming logs are forwarded by syslog-release, the parsed fields are:

  • $.director: the configured name of bosh director
  • $.deployment: the name of deployment from which event was sent
  • $.group: the name of the instance from which event was sent

Finally, logs are stored under /var/vcap/store/loghost/{$.director}/{$.deployment}/{$.group}.log

Rotation

The job also configures local logrotate in order to rotate and compress logs every hour. Rotated logs are stored in the same directories with the -%Y%m%d%H.gz suffix.

The number of kept rotations can be configured loghost_concentrator.logrotate.max-hours property with a default value of 360 (i.e.: 15 days).

Forwarding and clustering

The job also provides the possibility to re-forward received syslog event under specified conditions; this can be useful for:

  • Clusterize multiple concentrators in order to create a kind of backup across independent BOSH directors
  • Forward business or security critical events to an external log handling platform

clustering

Forwarding is configured from the loghost_concentrator.syslog.forward property by defined target objects as follows:

<target-name>:
  conditions:
  - <condition>
  - ...
  targets:
    - address: hostname
      port: port
      transport: tcp|udp|relp
    - ...

Where:

  • <condition> are valid rainerscript expressions with parenthesis. Multiple conditions can be given, all must be true to trigger the forward
  • <targets>: is a list of syslog endpoints where matching events are forwarded. When multiple targets are defined, matching events will be forwarded to all endpoints

Example:

jobs:
  - name: loghost_concentrator
    release: loghost
    properties:
      loghost_concentrator:
        syslog:
          forward:
            my-forward-target:
              conditions:
              - ($.director   isequal "local-director-name")
              - ($.deployment isequal "cf")
              targets:
                - address: target1.hostname.example.com
                  port: 514
                  transport: tcp
                - address: target2.hostname.example.com
                  port: 514
                  transport: tcp

DNS

Assuming that your deployment uses bosh-dns, the job loghost_dns can be used to define new aliases.

DNS aliases are configured from the loghost_dns.aliases key with the same syntax as the aliases key of bosh-dns job.

Example:

jobs:
  - name: loghost_dns
    release: loghost
    properties:
      loghost_dns:
        aliases:
          my.alias.internal:
          - 127.0.0.1
          my.other.alias.internal:
          - '*.collector-z1.default.logsearch.bosh'
          - '*.collector-z2.default.logsearch.bosh'

Exporter

The loghost_exporter job installs and configures the grok_exporter. This brilliant program processes log files and computes Prometheus metrics according to parse rules given in grok format.

Parsing rules are defined by the loghost_exporter.metrics key with the exact same syntax defined by the grok_exporter-metrics.

In addition, loghost_exporter.directors and loghost_exporter.deployments keys must be configured to give the list of logs files that the exported should watch.

Note: A limitation in the grok_exporter implementation forces watched directories to pre-exist at exporter startup. Because rsyslog files are created on the fly when events are received, the job creates required directories in its pre-start script.

In addition to user-defined metrics, the exporter provides builtin metrics.

Ops-files provided in the release also provide metrics, as described in the usage section.

Alerts

The job loghost_alerts defines the following alerts for your prometheus-boshrelease deployment:

  • LoghostNoLogReceived: triggers if exporter reports no processed logs in the last 15 minutes
  • LoghostDroppedMessages: triggers when there is an increase of "failed to write to target.example.net:6067" in the logs

When loghost_alerts.security.enabled key is set to true (default false), the job also defines the following alerts:

  • SecurityTooManySystemAuthFailures: triggers when audispd reports too many auth failures. audispd logs are generated by all virtual machines deployed by bosh
  • SecurityTooManyUaaClientFailures: triggers when uaa component reports too many client authentication failures
  • SecurityTooManyUaaUserFailures: triggers when uaa component reports too many user authentication failures
  • SecurityTooManyDiegoSshFailures: triggers when ssh_proxy component running on (scheduler instance) reports too many SSH authentication failures to containers
  • SecurityTooManyDiegoSshSuccess: triggers when ssh_proxy component running on (scheduler instance) reports too many SSH authentications to containers

Alert thresholds and evaluation time can be configured from job's spec.

Dashboards

The job loghost_dashboards adds Grafana dashboards for your prometheus-boshrelease deployment.

  • a global overview giving the system status, number of processed logs per rules, deployments and instances

  • a security dashboard overview giving information on authentications when loghost_dashboards.security.enabled key is enabled.

Usage

Step 1: Deploy loghost

First, you must add loghost instance to the deployment of your choice. You can use the following ops-files:

  • manifests/operations/loghost-concentrator-enable.yml
  • manifests/operations/loghost-exporter-enable.yml
  • manifests/operations/loghost-exporter-enable-security.yml

It will add the instance loghost with basic features enabled:

  • received log written to /var/vcap/store/loghost
  • grok_exporter reading and generating metrics from received logs

Step 2: Forward all logs to loghost instance

The simplest way to forward all logs at once is to create a runtime-config.yml using the syslog-release.

With file runtime-syslog-forward.yml:

addons:
- exclude:
    instance_groups:
    - loghost
  jobs:
  - name: syslog_forwarder
    properties:
      syslog:
        address: q-s0.loghost.default.((deployment)).bosh
        director: ((director_name))
        transport: udp
    release: syslog
  name: syslog_forwarder
releases:
- name: syslog
  sha1: 658fe5d6f049ec50383c09c0b227261251bfd4eb
  url: https://artifactory/cloudfoundry/syslog/syslog-11.6.1-ubuntu-xenial-621.tgz
  version: 11.6.1

Upload to bosh director: bosh update-runtime-config --name syslog-forward runtime-syslog-forward.yml

Step 3: Add alerts and dashboard to prometheus

Add the following ops-files to your prometheus deployment:

  • manifests/operations/prometheus/loghost-enable.yml
  • manifests/operations/prometheus/loghost-enable-security.yml

It will:

  • define scrape config based on bosh_exporter discovery
  • define new alerts
  • add dashboards to Grafana

Reference

Ops-files

name description
loghost-concentrator-enable.yml add instance with loghost_concentrator job listening on udp
loghost-concentrator-enable-tcp.yml configure loghost_concentrator to listen on tcp addition to udp
loghost-dns-enable.yml add loshost_dns job with empty aliases list
loghost-exporter-enable.yml add loghost_exporter job which spawns grok_exporter with a default set of metrics
loghost-exporter-enable-security.yml add security metrics to loghost_exporter job, grok rules for uaa and audispd
prometheus/loghost-enable.yml add discovery scraping of grok_exporter, default alerts and dashboards
prometheus/loghost-enable-security.yml add security alerts and dashboards

Metrics

In addition to grok_exporter grok-builtin-metrics, the release defines:

name dimensions type description
loghost_total director, deployment, group (Counter) log processed
loghost_error_total director, deployment, group (Counter) log detected as level error
loghost_auth_failures director, deployment, group, source, ip (Counter) system authentication failures
loghost_auth_failures_last_5m director, deployment, group, source, ip (Gauge) system authentication failures in the last 5 minutes (*)
loghost_auth_success director, deployment, group, source, ip, username (Counter) system authentication success
loghost_auth_success_last_5m director, deployment, group, source, ip, username (Gauge) system authentication success in the last 5 minutes (*)
loghost_uaa_client_login_success director, deployment, group, ip, clientid (Counter) UAA client authentication success
loghost_uaa_client_login_success_last_5m director, deployment, group, ip, clientid (Gauge) UAA client authentication success in the last 5 minutes (*)
loghost_uaa_client_login_failure director, deployment, group, ip, clientid (Counter) UAA client authentication failures
loghost_uaa_client_login_failure_last_5m director, deployment, group, ip, clientid (Gauge) UAA client authentication failures in the last 5 minutes (*)
loghost_uaa_user_login_success director, deployment, group, ip, username (Counter) UAA user authentication success
loghost_uaa_user_login_success_last_5m director, deployment, group, ip, username (Gauge) UAA user authentication success in the last 5 minutes (*)
loghost_uaa_user_login_failure director, deployment, group, ip, username (Counter) UAA user authentication failures
loghost_uaa_user_login_failure_last_5m director, deployment, group, ip, username (Gauge) UAA user failures in the last 5 minutes (*)

With dimension values:

  • director, deployment, group: BOSH director name, deployment name and instance group name from where the log was originally emitted
  • source: the exe field of type=USER.* message of audispd
  • ip: the remote address from which the authentication was attempted
  • clientid: the clientid used to authenticate a client on UAA
  • username: the username used to authenticate a user on UAA

(*) Tech note: Because metrics dimensions values are created over time depending on encountered logs, we cannot rely on rate or increase prometheus function to compute the number of failures on a period of time. As a bypass, we manually compute this metric with a hackish record rule defined as:

  sum(<metric> or <metric>{} * 0) by (<dimensions...>)
  -
  sum(<metric> offset 5m or <metric>{} * 0) by (<dimensions...>)