turtletoss: A Python repository from thekad

Overview

This is a framework to perform rolling restarts of distributed databases.

Most distributed databases that require a rolling restart follow some basic principles:

You grab a list of nodes to be restarted from some source of truth
These nodes are sorted in some way, usually
You may execute some pre-cluster-roll action
For each of the nodes to be restarted:
- You may execute some pre-stop action
- You stop the service
- You may execute some post-stop action
- You may execute some custom script
- You may execute some pre-start action
- You start the service
- You may execute some post-start action
You may execute some post-cluster-roll action

Requirements

SSH access to all the target servers where the service is running
sudo access on said servers. This is kind of optional, depending how you're running your services, the default call uses sudo service
Python <http://python.org>
Python Fabric <http://fabfile.org>

Install

Create a virtual environment (I recommend virtualenvwrapper but whatever fits your bill)
pip install -r requirements.txt

General Usage

Each database has its own quirks and peculiarities however the general usage of this tool is as follows:

turtletoss [options] [stop:"<call>"] [start:"<call>"] [commit] [paranoid] <service> [pre-tasks] restart [post-tasks]

Order is important since Fabric evaluates these tasks in the order they are called, explanation follows:

options are Fabric valid options since turtletoss is just a thin wrapper around Fabric
stop:"<call>" will override the default sudo service <foo> stop call to be executed in each of the nodes
start"<call>" will override the default sudo service <foo> start call to be executed in each of the nodes
commit is an optional task that turns on the commit bit for the restart to actually do stuff. If you don't set this then it will basically do a dry run. The default is to run in dry-run mode because that's the safest approach
paranoid is an optional task that pauses the execution after each service has been restarted and waits for confirmation from the operator. This is turned off by default
service is one of the supported databases in the section below. This task populates the hosts/roles lists (and sorts them if necessary) for Fabric to proceed during the restart
pre-tasks are custom tasks per service that you may define to be executed prior to a cluster-wide rolling restart
roll is the actual rolling-restart task, remember that if you didn't set the commit before before to this call it won't actually do anything. Also remember that if you didn't call the service population task the list of hosts will be empty
post-tasks are custom tasks per service that you may define to be executed after a cluster-wide rolling restart

Supported databases

Elasticsearch

The way this service populates its data is by polling a given elasticsearch cluster API and pulling the list of nodes currently in the cluster. There are a few restrictions though:

All nodes in the cluster need to be named. Meaning that the node.name attribute has a unique ID. If you are not naming your nodes this is probably because your environment is dynamic (maybe you're using ephemeral containers or whatnot) so a rolling restart is probably not how you'd do a cluster restart anyway
The way you can SSH and stop/start services in the target hosts is homogeneous

Elasticsearch populates some roles (as understood by Fabric) to make it easy to interact with it:

clients are all the client nodes (node.data: false and node.master: false)
masters are all the stand-by masters (node.master: true and node.data: false and they are not the active master)
data are all data nodes (node.data: true and they are not the active master)
active is the active master node
almost are all nodes but the active master
all are all nodes

By default this interacts with all nodes, if you don't want that you can very easily use -R <list of roles> in the options

The elasticsearch rolling-restart strategy is as follows: