/drax

DC/OS Resilience Automated Xenodiagnosis tool

Primary LanguageGoApache License 2.0Apache-2.0

DRAX

This is DRAX, the DC/OS Resilience Automated Xenodiagnosis tool. It helps to test DC/OS deployments by applying a Chaos Monkey-inspired, proactive and invasive testing approach.

Well, actually DRAX is a reverse acronym inspired by the Guardians of the Galaxy character Drax the Destroyer.

You might have heard of Netflix's Chaos Monkey or it's containerized variant. Maybe you've seen a gaming version of it or stumbled upon a lower-level species. In any case I assume you're somewhat familiar with chaos-based resilience testing.

DRAX is a DC/OS-specific resilience testing tool that works mainly on the task-level. Future work may include node-level up to cluster-level.

Installation and usage

Note that DRAX assumes a running DC/OS 1.9 cluster.

Production

Launch DRAX using the DC/OS CLI via the Marathon app spec provided:

$ dcos marathon app add marathon-drax.json

Now you can (modulo the public node of your cluster) do the following:

$ http http://ec2-52-38-188-110.us-west-2.compute.amazonaws.com:7777/stats
HTTP/1.1 200 OK
Content-Length: 10
Content-Type: application/javascript
Date: Mon, 13 Jun 2016 14:39:11 GMT

{"gone":0}

If you launched DRAX via Marathon, you can also trigger a POST to the /rampage continuously by deploying a DC/OS job. The example job is triggering the destruction every business hour from Monday till Friday:

$ dcos job add metronome-drax.json

Testing and development

Get DRAX and build from source:

$ go get github.com/dcos-labs/drax
$ go build
$ MARATHON_URL=http://localhost:8080 ./drax
INFO[0000] This is DRAX in version 0.4.0                 main=init
INFO[0000] Listening on port 7777                        main=init
INFO[0000] On destruction level 0                        main=init
INFO[0000] Using Marathon at  http://localhost:8080      main=init
INFO[0000] I will destroy 2 tasks on a rampage           main=init

And in a different terminal session:

$ http http://localhost:7777/stats
HTTP/1.1 200 OK
Content-Length: 10
Content-Type: application/javascript
Date: Mon, 13 Jun 2016 14:39:11 GMT

{"gone":0}

For Go development, be aware of the following dependencies (not using explicit vendoring ATM):

Configuration

Note that the following environment variables are pre-set in the Marathon app spec and yours to overwrite.

Number of target tasks

To specify how many tasks DRAX is supposed to destroy in one rampage, use NUM_TARGETS. For example, NUM_TARGETS=5 drax means that (up to) 5 tasks will be destroyed, unless the overall number of tasks is less, of course.

Log level

To influence the log level, use the LOG_LEVEL env variable, for example LOG_LEVEL=DEBUG drax would give you fine-grained log messages (defaults to INFO).

API

/health [GET]

Will return a HTTP 200 code and I am Groot if DRAX is healthy.

/stats [GET]

Will return runtime statistics, such as killed containers or apps and will report from the beginning of time (well, beginning of time for DRAX anyways).

$ http http://localhost:7777/stats
HTTP/1.1 200 OK
Content-Length: 10
Content-Type: application/javascript
Date: Mon, 13 Jun 2016 14:39:11 GMT

{"gone":2}

/rampage [POST]

Will trigger a destruction. Invoke with:

$ http POST localhost:7777/rampage
HTTP/1.1 200 OK
Content-Length: 121
Content-Type: application/javascript
Date: Mon, 13 Jun 2016 12:15:19 GMT

{"success":true,"goners":["webserver.0fde0035-315f-11e6-aad0-1e9bbbc1653f","dummy.11a7c3bb-315f-11e6-aad0-1e9bbbc1653f"]}