\[._.]/ - Hapi and Healthly API

Version 6.x.x only supports hapi v17 and above!

This Hapi.js plugin provides a configurable route for /service-status (/health) API reporting which returns a varied output depending on the consumer headers, request type and query flags.

The primary consumer is a Local Traffic Manager (LTM), which load balances and adds/removes nodes from rotation based on the API return status. You can add an arbitrary number of tests to the test.node array (in config), which will run in parallel and report basic health status for your node. Keep in mind that an LTM will hit this API every 1-10 seconds so the test functions should run really fast. Caching policy and what those tests actual are is entirely up to the application :)

NOTE: failing dependecy services should never cause your node to be marked bad (lest you cascade failures down the chain--and remove your entire app stack from the node pool). Your tests should only validate that your node is configured and running correctly (otherwise, an LTM would remove a good node out of the pool only because another service went down).

A secondary consumer is a DevOps maintainer who wants to see the status of the service and get more detailed information on what's going on with it.

Examples:

what version/checksum/hash/git rev-parse HEAD is actually running (does it match the deployment manifest)?
what environment is the node configured to run (DEV, QA, STAGE, PROD, etc)?
what remote service environments is it setup to use (DEV, QA, STAGE, PROD, etc)?
what what memcached servers are loaded in the pool?
why is it marked as good, bad or in a warning state (useful messages)?
how does the CPU/memory look on this node?
etc...

Query flags are available for verbose output (?v) to machines and humans. This API will report cpu and memory load for the system and for the hapi server process itself. The human friendly flag (?v&h) converts values from bytes to KB/MB/GB and usage to percentage of the system rather than flat values

Installation:

npm i -S hapi-and-healthy

Demo:

Run the demo to see it in action and view the demo.js file for the code

git clone git@github.com:atomantic/hapi-and-healthy.git
cd hapi-and-healthy;
npm i;
npm test;

Tests:

You can run tests with npm test

Configuration Options

auth - (string|false) The name of the auth strategy (default is false)
custom - (object) Additional custom data to return (e.g. custom:{memcached:memcached.servers})
defaultContentType - (string) Default content type for requests (defaults to text/plain)
env - (string) The running environment of your app (e.g. DEV, QA, STAGE, PROD). This will be returned in verbose output for consumers wishing to know what environment your service thinks it's running in.
id - (string) An ID of the state of this system, like a checksum (default: '[no id provided]')
lang - (string) Default 'en' a language override for the human output health data. This endpoint uses the Humanize Duration package so any valid language override for that library will be valid here (fr, de, ko, etc)
name - (string) The name of your service (reported in verbose mode), probably supplied by your package.json
path - (string) An override path for the default '/service-status' endpoint
paths - (array) A list of available versioned paths on this service (e.g. ["v1", "v2"]). This can be used for automated discovery of versioned endpoints deployed on this service (e.g. for detecting the location of a /v2/feature-status API endpoint)
schema - (string) Schema version number (defualts to 1.1.0 -- the schema version of this library)
tags - (array) Hapi Route tags for your status API (defaults to ['api', 'health', 'status'])
test.node - (array) A set of Promises to run for testing your node health
- each Promise should resolve an error or success
- message is an optional mixed value (json or string) that will give more info about that status
test.features - (array) A set of Promises to run for testing optional features and dependencies. Ideally this would be a query on a file dump of a smoke test that gets run periodically to test each of the API endpoints or features of your service. It could also be a check to memcached for logs of known errors in the system (counter of unhandledException, cached by API path or flow, etc).
- message is an optional mixed value (json or string) that will give more info about that status
usage - (boolean) - show usage/health information (cpu, memory, etc). Default: true
version - (string) - the version of your service (probably from your package.json)

Example

const Hapi = require('hapi')
// Hapi Server
const server = Hapi.createServer({
    host: 'locahost',
    port: 3192
})

// capture app ENV
const env = process.env.NODE_ENV||'DEV'

// node os (for test example)
const os = require('os')

// local memcached (for test example)
const Memcached = require('memcached')
const memcached = new Memcached('localhost:11211')

// example app-specific module for feature testing
// we have a content engine for keeping the site content up-to-date (see feature tests below)
const content = require('./lib/content')

const pjson = require('./package')

// Register the plugin with custom config
server.register([{
  plugin: require("hapi-and-healthy"),
  options: {
    custom: {
        // let's just say we want to keep an eye on the memcached pool
        memcached: memcached.servers
    },
    env: env,
    name: pjson.name,
    test:{
      // a series of tests that run in async parallel
      // if any one of them fails, it returns immediately to the async callback
      // which tells the API to reply with the failure.
      node:[
        // TEST 1: validate the version of this codebase matches release for this ENV
        () => new Promise((resolve, reject) => {
          // check the release version against current codebase.
          // At deploy time, we update memcache with the release version for this env
          // using a deploy script (stored under 'app_version_'+env)
          memcached.get('app_version_'+env,function(err,data){
            if(err) return cb(true, err)

            if(data!==pjson.version){
              // this codebase does not match our release manifest
              // don't allow it in rotation
              reject('version mismatch. Expected version is '+data+' but running '+pjson.version)
            }
            // ok, all good on this check
            resolve('matches expected version ('+pjson.version+')')
          })
        })
      ],
      features:[
        () => new Promise((resolve, reject) => {
          // let's say we have a content directory that we use a tool like chef to
          // dump onto the running node from a github repo
          // this is a seperate dependency from the node
          // whenever the app loads a new hash of the content
          // (via fs.watch on the .git repo for the content directory)
          // it updates memcached with the new hash for content.
          // Our status page will check memcached from our running app's idea of the current
          // content hash. If it's not a match then this node is out of date
          // and we want to flag it in a WARN state (but not pull the node out of rotation).
          memcached.get('content_hash',function(err, data){
            // console.log('memcached found', err, data);
            if(err){
                reject('memcached error: '+err)
            }
            if(data!==content.hash){
              // latest memcached version is different from this node's
              // idea of what the content version is
              // which means this node is behind other nodes
              reject('content has fallen behind other nodes: '+content.hash+'(app) vs '+data+' (memcached)')
            }
            resolve('content matches other nodes')
          })
        })
      ]
    },
    version: pjson.version
  }
}])
.then(() => server.start())

API

The API endpoint is configurable but defaults to /service-status
Additionally, the following query params are allowed:
- v - verbose mode
- h - human friendly mode
GET requests supplied with header If-None-Match: {etag} will return 304 not modified and empty body if the etag (base64 encode of status output minus published date) is a match

Spec

`/service-status`

returns simple health check for LTM (Local Traffic Manager) monitoring.

This route will enforce auth:false since the LTM needs to hit this so frequently and it does not expose sensitive data

If the node fails any of the test functions supplied in options.test.node

⇒  curl -i -H "Accept: text/plain" http://127.0.0.1:3192/service-status
HTTP/1.1 500 Internal Server Error
content-type: text/plain; charset=utf-8
content-length: 3
cache-control: no-cache
Date: Wed, 03 Sep 2014 23:16:54 GMT
Connection: keep-alive

BAD%

OR, if the node passes all the LTM tests supplied in options.test.node

⇒  curl -i -H "Accept: text/plain" http://127.0.0.1:3192/service-status
HTTP/1.1 200 OK
content-type: text/plain; charset=utf-8
content-length: 4
cache-control: no-cache
accept-ranges: bytes
Date: Wed, 03 Sep 2014 23:16:33 GMT
Connection: keep-alive

GOOD%

OR, if the node passes all the LTM tests supplied in options.test.node BUT, it did not pass all of the feature tests supplied in options.test.features (still returns 200 to keep node in rotation, but flags this node in WARN state)

⇒  curl -i -H "Accept: text/plain" http://127.0.0.1:3192/service-status
HTTP/1.1 200 OK
content-type: text/plain; charset=utf-8
content-length: 4
cache-control: no-cache
accept-ranges: bytes
Date: Wed, 03 Sep 2014 23:16:33 GMT
Connection: keep-alive

WARN%

`/service-status?v`

runs full, verbose suite of health checks and returns machine friendly output

{
  "service": {
    "env": "DEV",
    "id": "98CF189C-36E0-416B-A2ED-90CE36F8D330",
    "name": "my_service",
    "version": "1.0.0",
    "custom": {
      "health": {
        "cpu_load": [
          1.619140625,
          1.732421875,
          1.88818359375
        ],
        "mem_free": 354811904,
        "mem_free_percent": 0.02065277099609375,
        "mem_total": 17179869184,
        "os_uptime": 606723
      }
    },
    "schema": "1.0.0",
    "status": {
      "state": "GOOD",
      "message": [
        "checksum matches manifest",
        "content matches other nodes"
      ],
      "published": "2014-09-24T03:27:59.575Z"
    }
  }
}

`/service-status?v&h`

runs full, verbose suite of health checks and returns human friendly output

{
  "service": {
    "env": "DEV",
    "id": "98CF189C-36E0-416B-A2ED-90CE36F8D330",
    "name": "my_service",
    "version": "1.0.0",
    "custom": {
      "health": {
        "cpu_load": [
          2.263671875,
          2.107421875,
          2.05810546875
        ],
        "mem_free": "464.19 MB",
        "mem_free_percent": "0.03%",
        "mem_total": "17.18 GB",
        "os_uptime": "10 minutes, 7.686 seconds"
      }
    },
    "schema": "1.0.0",
    "status": {
      "state": "GOOD",
      "message": [
        "checksum matches manifest",
        "content matches other nodes"
      ],
      "published": "2014-09-24T03:27:59.575Z"
    }
  }
}

Support

Hey dude! Help me out for a couple of 🍻!