/onyx-peer-http-query

A library that supports pull-based health checks of Onyx peers

Primary LanguageClojureEclipse Public License 1.0EPL-1.0

onyx-peer-http-query

Onyx Peer HTTP Query provides an inbuilt HTTP server to service replica and cluster queries that can be directed at Onyx nodes. One use case is to provide a health check for your Onyx nodes, as it becomes easy to determine what a node's view of the cluster is.

HTTP Server

This library exposes an HTTP server to service replica and cluster queries across languages.

To use it, add onyx-peer-http-query to your dependencies:

[org.onyxplatform/onyx-peer-http-query "0.14.5.0"]

Require onyx.http-query in your peer bootup namespace:

(:require [onyx.http-query])

And add the following lines to Onyx's peer-config

 :onyx.query/server? true
 :onyx.query.server/port 8080

In addition, you can optionally add the IP to listen on with

 :onyx.query.server/ip "127.0.0.1"

JMX selectors can, and should be whitelisted/queried via the peer-config: e.g.

 :onyx.query.server/metrics-selectors ["org.onyxplatform:*" "com.amazonaws.management:*"]

The default behaviour is

 :onyx.query.server/metrics-selectors ["*:*"]

Individual metrics tags can be blacklisted via the peer-config:

 :onyx.query.server/metrics-blacklist [#"blacklisted_tag1" #"blacklistregex.*"]

Accessing the HTTP server

Then query it to get a view of that nodes understanding of the cluster:

$ http --json http://localhost:8080/replica/peers
HTTP/1.1 200 OK
Content-Length: 197
Content-Type: application/json
Date: Tue, 23 Feb 2016 03:35:08 GMT
Server: Jetty(9.2.10.v20150310)

{
    "as-of-entry": 12,
    "as-of-timestamp": 1456108757818,
    "result": [
        "e52df81d-38c9-44e6-9e3d-177d3e83292b",
        "fd4725f9-3429-49eb-840d-6c3e29cecc41",
        "fc933dda-7260-4547-93fc-241a02ca599a"
    ],
    "status": "success"
}

Note as-of-entry and as-of-timestamp. By comparing as-of-entry between nodes, you can discover whether a node is lagging behind the cluster.

Further API endpoints are described here.

Endpoints

The Replica Query Server has a number of endpoints for accessing the information about a running Onyx cluster. Below we display the HTTP method, the URI, the docstring for the route, and any associated parameters that it takes in its query string.

Summary

  • /health
  • /peergroup/heartbeat
  • /peergroup/stuckpeers
  • /peergroup/health
  • /network/media-driver/active
  • /metrics
  • /state
  • /job/catalog
  • /job/flow-conditions
  • /job/lifecycles
  • /job/task
  • /job/triggers
  • /job/windows
  • /job/workflow
  • /job/exception
  • /replica
  • /replica/completed-jobs
  • /replica/job-allocations
  • /replica/job-scheduler
  • /replica/jobs
  • /replica/killed-jobs
  • /replica/peer-site
  • /replica/peer-state
  • /replica/peers
  • /replica/task-allocations
  • /replica/allocation-version
  • /replica/task-scheduler
  • /replica/tasks


Route

[:get] /health

Query Params Schema

{"threshold" java.lang.Long}

Docstring

A single health check call to check whether the following statuses are healthy: /network/media-driver/active, /peergroup/heartbeat, and /peergroup/stuckpeers. Considers the peer group dead if timeout is greater than ?threshold=VALUE. Returns status 200 if healthy, 500 if unhealthy. Use this route for failure monitoring, automatic rebooting, etc.

--

Route

[:get] /peergroup/heartbeat

Query Params Schema

{}

Docstring

Returns the number of milliseconds since the peer group last heartbeated.


Route

[:get] /peergroup/stuckpeers

Query Params Schema

{}

Docstring

Returns the number of milliseconds that a peer has been stuck while being shutdown, indicating a stuck thread.


Route

[:get] /peergroup/health

Query Params Schema

{}

Docstring

A health check call to check whether the peer group has heartbeated more recently than a threshold. Considers the peer group dead if timeout is greater than ?threshold=VALUE. Returns status 200 if healthy, 500 if unhealthy. Use this route for failure monitoring, automatic rebooting, etc.

Route

[:get] /network/media-driver

Query Params Schema

{}

Docstring

Returns a map describing the media driver status. e.g.

{:active true, 
 :driver-timeout-ms 10000, 
 :log "INFO: Aeron directory /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas exists
       INFO: Aeron CnC file /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas/cnc.dat exists
       INFO: Aeron toDriver consumer heartbeat is 687 ms old"}

Route

[:get] /network/media-driver/active

Query Params Schema

{}

Docstring

Returns a boolean for whether the media driver is active and has heartbeated within driver-timeout-ms milliseconds.


Route

[:get] /metrics

Query Params Schema

{}

Docstring

Returns any numeric JMX metrics contained in this VM, converted to prometheus tags.


Route

[:get] /state

Query Params Schema

{"job-id" java.lang.String "task-id" java.lang.String "slot-id" java.lang.Long "window-id" java.lang.String "allocation-version" java.lang.Long ;; optional "start-time" java.lang.Long ;; optional "end-time" java.lang.Long ;; optional "groups" [Any]}

Docstring

Retrieve a task's window state for a particular job. Must supply the :allocation-version for the job. The allocation version can be looked up via the /replica/allocation-version, or by subscribing to the log and looking up the [:allocation-version job-id].

If groups is supplied, only the state for the groups supplied will be retrieved.


Route

[:get] /job/catalog

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns catalog for this job.


Route

[:get] /job/flow-conditions

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns flow conditions for this job.


Route

[:get] /job/lifecycles

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns lifecycles for this job.


Route

[:get] /job/task

Query Params Schema

{"job-id" java.lang.String, "task-id" java.lang.String}

Docstring

Given a job id and task id, returns catalog entry for this task.


Route

[:get] /job/triggers

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns triggers for this job.


Route

[:get] /job/windows

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns windows for this job.


Route

[:get] /job/workflow

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns workflow for this job.


Route

[:get] /job/exception

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns the exception that killed this job, if one exists.


Route

[:get] /replica

Query Params Schema

``

Docstring

Derefences the replica as an immutable value.


Route

[:get] /replica/completed-jobs

Query Params Schema

``

Docstring

Lists all the job ids that have been completed.


Route

[:get] /replica/job-allocations

Query Params Schema

``

Docstring

Returns a map of job id -> task id -> peer ids, denoting which peers are assigned to which tasks.


Route

[:get] /replica/job-scheduler

Query Params Schema
Docstring

Returns the job scheduler for this tenancy of the cluster.


Route

[:get] /replica/jobs

Query Params Schema

``

Docstring

Lists all non-killed, non-completed job ids.


Route

[:get] /replica/killed-jobs

Query Params Schema

``

Docstring

Lists all the job ids that have been killed.


Route

[:get] /replica/peer-site

Query Params Schema

{"peer-id" java.lang.String}

Docstring

Given a peer id, returns the Aeron hostname and port that this peer advertises to the rest of the cluster.


Route

[:get] /replica/peer-state

Query Params Schema

{"peer-id" java.lang.String}

Docstring

Given a peer id, returns its current execution state (e.g. :idle, :active, etc).


Route

[:get] /replica/peers

Query Params Schema

``

Docstring

Lists all the peer ids.


Route

[:get] /replica/task-allocations

Query Params Schema

``

Docstring

Given a job id, returns a map of task id -> peer ids, denoting which peers are assigned to which tasks for this job only.


Route

[:get] /replica/allocation-version

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns the replica-version at which the job last rescheduled. This is important because the replica-version forms part of the vector clock that is used to determine ordering/validity of messages in the cluster, along with the barrier epoch.


Route

[:get] /replica/task-scheduler

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns the task scheduler for this job.


Route

[:get] /replica/tasks

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns all the task ids for this job.

License

Copyright © 2016 Distributed Masonry Inc.

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.