These Fabric tasks are not meant to stateful, like Ansible, or Salt. It is assumed that you have spawned a brand new machine when you run a task, or set of tasks together. No state checking is done.
- Install vagrant, and virtualbox too if it's not pulled in by vagrant.
- Install and configure
ssh-agent
(this isn't necessary if you have an unsecured vulnerable password-less ssh key). - Decrypt your ssh key you use for Github:
ssh-add
orssh-add ~/.ssh/mykey.pem
- Add the below section to your
~/.ssh/config
:Host localhost:2222 ForwardAgent yes Host 127.0.0.1:2222 ForwardAgent yes
Run a task on a host. We don't maintain a host list because they are all expendable, and Cloud at Cost takes forever to spawn servers. Think immutable-architecture. You should have spawned a brand-spanking new machine to replace another, instead of updating it in-place.
Use this to quickly view a list of all tasks.
For recovering stopped nodes read both: Resetting the Quorum Galera Replication - How to recover a PXC Cluster
Read docs on wresp_cluster_address to understand how they work before you screw up the cluster.
fab mariadb.build_cluster:<HOST_A>,<HOST_B>,<HOST_C> -I
fab mariadb.add_admin:lgunsch -H <IP_A> -I
or
fab mariadb.install:'<HOST_A>\,<HOST_B>' -IH <HOST_C>
fab mariadb.start -IH <HOST_C> # restart after install is required, but install does not make assumptions
There must always be an odd number of nodes. If there must be an even number of nodes, at least have one machine with an arbitrator to make the voting odd. Another option is to specify some nodes with a higher weight.
Galera keeps state files around that apt will not remove, so even if you
apt-get purge mariadb-server galera-3
and re-install it, it will try
and re-join its cluster to recover. If weird stuff is happening, build a
fresh set of machines for a new cluster.
Important: If the whole cluster is shutdown, you must start up the cluster by starting the node with the most advanced node state ID. Individual nodes will refuse to start until that node has first been started. If you cannot because it's gone, read the below "Cluster is shutdown" section, as well as the "manual bootstrap" section of [Resetting the Quorum].
For the definition of primary, which is mentioned in the logs a lot, see:
This is a summary of [Restarting the Cluster] and [Resetting the Quorum] "Manual Bootstrap" section.
For example, if a whole data-center fails (squinting at you Cloud@Cost), and it was the primary (which is bad practice). Once you spin up new servers to replace the ones lost it will take a few steps to get the cluster going again.
- Stop all MariaDB instances across the cluster, new ones and old ones.
MariaDB seems to have trouble with
/etc/init.d/mysql stop
, so you may have to also kill processes manually. Note: Ubuntu starts daemons without asking, so you will have to kill MariaDB right after a fresh install. - On one node that is known to have the complete data set (likely
the remaining non-failed server), edit
/var/lib/mysql/grastate.dat
, changingsafe_to_bootstrap
to1
.safe_to_bootsrap
is a protection which we may have to disable since there are no remaining usable servers from the primary component. - Start that node with
service mysql --wsrep-new-cluster
to create a new cluster with it's current saved state. - Now start the remaining MariaDB instances to join the new cluster.
Don't forget to restart HAProxy since it doesn't re-resolve DNS.
Currently setup as a simple replicate volume between the 3 nodes, without any distributed volumes.
Warning: Gluster authenticates it peer nodes hostnames via reverse DNS. You must have reverse DNS setup before adding peer nodes.
If you forgot to set the reverse DNS, try and follow the When you replace a node
steps.
If you are replacing a failed node, but reusing a hostname, do this:
- Stop GlusterFS:
service glusterfs-server stop
. - Grab the UUID of the failed node from one of the good peers:
gluster peer status
. - Remove everything, except for
glusterd.info
in/var/lib/glusterd/glusterd.info
:rm -Rf bitd/ geo-replication/ glustershd/ groups/ hooks/ nfs/ options
peers/ quotad/ scrub/ snaps/ ss_brick/ vols/
. - Update
/var/lib/glusterd/glusterd.info
to have the correct UUID, as taken from a peer earlier. - Start GlusterFS:
service glusterfs-service start
- Probe a good gluster peer:
gluster peer probe node-1.helix-cloud.ca
- Possibly re-start the
glusterfs-service
until it becomes "Peer in Cluster" - Make sure all peers have a "proper looking"
gluster pool list
. - From the good node:
gluster volume sync HOSTNAME
.
Add release entries to the debian/changelog
file using dch
tool.
Use the dch
debian tool to add the entries in the correct format
in the spun-up ubuntu build box. Commit the entries into git, bump
and tag your version also if you wish, and finally don't forget to
git push.
Creates a cleanly built .deb
package ready for installation.
- If git clone asks for your password while configuring the build box,
then you don't have
ssh-agent
running with your key added usingssh-add
. - If you have had the changelog box running a long time it might not
be destroyed upon logging out. You can kill it manually with
vagrant destroy -f