/cupcake

Quick and dirty HTTP(S)/TCP endpoint monitor

Primary LanguagePythonMIT LicenseMIT

Cupcake

Build Status Docker Pulls

Table of Contents

This is a very simple HTTP, HTTPS and TCP endpoint monitor intended to be the simplest thing that works.

It will work through a file that defines groups of environments and endpoints.

If an endpoint times out for connection or if an HTTP/HTTPS endpoint returns a different status code than is expected, then an alert is processed. Cupcake records previous alerts in a backing database and therefore will only alert once for a failure, but will note a return to service for an endpoint together with an approximate length of time that the outage occurred for.

Expected status codes for HTTP/HTTPS services is specified with a regular expression.

At the moment Cupcake is able to emit alerts via a webhook URL such as the one used by Slack's custom webhook integration and also by sending a JSON blob to an SNS topic.

If the environment variable SUMMARY_ENABLED is "True", Cupcake will emit a summary digest at startup and every SUMMARY_SLEEP_SECONDS afterwards to the alert group called summary ( see Alert definition file, below).

Environment variables

Name Description Default
DEBUG Whether to produce debug messages in the log False
SLEEP_SECONDS Number of seconds to yield between runs 60
ENDPOINT_DEFINITIONS_FILE Full path or S3 URL of endpoint definitions file /opt/app/config/endpoints.json
ALERT_DEFINITIONS_FILE Full path or S3 URL of alert definitions file /opt/app/config/alerts.json
METRICS_DEFINITIONS_FILE Full path or S3 URL of metrics definitions file /opt/app/config/metrics.json
CONNECTION_TIMEOUT_SECONDS Number of seconds before HTTP(S) and TCP connections will timeout 10
DB_TYPE Type of database to use. Possible values: sqlite or postgresql sqlite
SUMMARY_ENABLED Whether to emit a summary / digest message to a subset of alert channels True
SUMMARY_SLEEP_SECONDS Number of seconds between emitting summary digests 86400
REMOVE_UNKNOWN_ACTIVES Whether to delete active alerts that are no longer present in alert defs False

Note:

It is important to have CONNECTION_TIMEOUT_SECONDS set to a value less than 30, as when a process in a containerised environment such as ECS is redeployed or stopped then it will be given a SIGTERM signal and a 30 second timeout before a SIGKILL signal is sent that will kill the process immediately. Cupcake tests whether it has been requested to stop after each endpoint measurement and intercepts SIGTERM and SIGINT in order to try and quit as soon as possible after receiving them.

sqlite

To use sqlite as the backing database, set the following:

DB_TYPE='sqlite'
Name Description Default
DB_NAME This is the full path of the .db file to use. cupcake.db

PostgreSQL

To use PostgreSQL as the backing database, set the following:

DB_TYPE='postgresql'
Name Description Default
DB_NAME This is the database name
DB_HOST This is the database host and port in host:port format
DB_USER This is the username to connect to database with
DB_PASSWORD This is the password to connect to database with

Endpoint definition file

Endpoints are organised in the following hierarchy:

environment_group (e.g. "customer 1")
|
+- environment (e.g. "production")
   |
   +- endpoint_group (e.g. "services")
      |
      +- endpoint (e.g. "api")

This gives a great deal of flexibility and range for defining collections of endpoints.

Example

The following defines an environment group called "customer ABC" which has an environment called "production". Within that environment are two endpoint groups - "external" and "internal".

The "external" endpoint group contains an HTTPS URL for a website and a regular expression that defines the HTTP status code that it expects to receive (any status code in range 2xx). An optional GUID will be added to the URL query string with the key cupcake_trace_id (which is the default). An optional attempt number will be added to the URL query string with the key cupcake_attempt (which is the default). The URL including the TraceID will be emitted in any alert incident that occurs allowing this to be located in server access logs. retry signifies to retry this endpoint 2 times if failure encountered. By default all endpoints that fail due to a timeout are retried 3 times - the retry value overrides that. It also sets a customer timeout of 30s, which will override the default set by CONNECTION_TIMEOUT_SECONDS

The "internal" endpoint group contains a TCP URL for a Redis server. It is assumed for this example that Cupcake is situated on a server that is inside the private network and therefore is able to lookup a host named "redis.internal" using some kind of internal DNS scheme (e.g. Route53).

The website endpoint also defines a threshold for the response timing where anything greater than 200 milliseconds will cause an incident to be raised.

{
  "@type": "endpoint-definitions",
  "groups": [
    {
      "@type": "environment-group",
      "id": "customer ABC",
      "logo": "",
      "environments": [
        {
          "@type": "environment",
          "id": "production",
          "endpoint-groups": [
            {
              "@type": "endpoint-group",
              "id": "external",
              "enabled": "true",
              "logo": "",
              "endpoints": [
                {
                  "@type": "endpoint",
                  "id": "website",
                  "url": "https://www.example.com/index.html",
                  "expected": "^[2]\\d\\d$",
                  "threshold": {
                    "min": 0,
                    "max": 200
                  },
                  "retry": 3,
                  "timeout": 30,
                  "appendTraceID": true,
                  "traceArgumentKey": "cupcake_trace_id",
                  "appendAttempt": true,
                  "attemptArgumentKey": "cupcake_attempt"
                }
              ]
            },
            {
              "@type": "endpoint-group",
              "id": "internal",
              "logo": "",
              "enabled": "true",
              "endpoints": [
                {
                  "@type": "endpoint",
                  "id": "redis",
                  "url": "tcp://redis.internal:6379"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Alert definition file

Alerts are defined in a separate file. Each alert has a type, an ID and whatever properties it needs to operate. Alerts are grouped together by defining an alert-group which references the id of the alerts in an array. See the next section for an example.

There are two standard groups: default and summary.

The default group contains the IDs of alerts that should receive incident notifications in the absence of an overriding instruction in the endpoints hierarchy. The summary group contains the IDs of alerts that should receive the summary digest that is emitted at startup and periodically thereafter.

Example

{
  "@type": "alert-definitions",
  "groups": [
    {
      "@type": "alert-group",
      "id": "default",
      "alerts": [
        "my-slack-channel",
        "my-aws-list"
      ]
    },
    {
      "@type": "alert-group",
      "id": "summary",
      "alerts": [
        "my-slack-channel"
      ]
    },
    {
      "@type": "alert-group",
      "id": "mysite",
      "alerts": [
        "my-slack-channel"
      ]
    }
  ],
  "alerts": [
    {
      "@type": "alert-slack",
      "id": "my-slack-channel",
      "url": "https://hooks.slack.com/services/xxx/yyy/zzz"
    },
    {
      "@type": "alert-sns",
      "id": "my-aws-list",
      "arn": "xxx",
      "region": "yyy"
    }
  ]
}

Metrics definitions file

Metrics output is also defined in a separate file. Like alerts, different metrics output streams are organised into groups, with default being the default collection of metrics streams that response times will be sent to.

Example

{
  "@type": "metrics-definitions",
  "groups": [
    {
      "@type": "metrics-group",
      "id": "default",
      "metrics": [
        "cloudwatch"
      ]
    }
  ],
  "metrics": [
    {
      "@type": "metrics",
      "id": "cloudwatch",
      "provider": {
        "@type": "cloudwatch",
        "region": "eu-west-1",
        "namespace": "CUPCAKE"
      }
    }
  ]
}