/swarm-ansible

Build a Production-Ready Docker Swarm cluster using Ansible. The goal is rapidly bootstrap a Docker Swarm cluster with some essential services.

Primary LanguageShellApache License 2.0Apache-2.0

Automated Docker Swarm bootstrap using Ansible

Build a Docker Swarm cluster using Ansible with swarm. The goal is to rapidly bootstrap a Docker Swarm cluster on machines running: Debian; Ubuntu.

๐Ÿšจ Disclaimer:

This repo aims to be a simple bootstrap of a docker swarm cluster using default configs, especially for the services in available composes. If you would like more configuration options (considering for security), please read their respective documentations and configure them manually.


โœ… System Requirements

  • Control Node (the machine you will be running the Ansible commands). I am using Ansible 2.15.1.
  • All swarm nodes (manager and workers) should have passwordless SSH access, this can be setup by passing your SSH public keys in the Oracle Compute Instance configuration. You can also check this Digital Ocean Guide to set up ssh key based authentication on linux machines.

๐Ÿš€ Getting Started

๐Ÿ’ป Hardware we will use

For this example, we will be using the Oracle Cloud Allways-Free Tier. We can instantiate one machine per Fault Domain, ensuring our VMs are not all on the same physical hardware.

For this demo, we are using 3 machines (all of them free):

  • Ampere aarch64 Altra 4-core CPU w/ 24GB RAM
  • AMD x86_64 EPYC 1-core CPU w/ 1GB RAM
  • AMD x86_64 EPYC 1-core CPU w/ 1GB RAM

This gives us a total of 8-threads and 26GB of RAM for free!

Since docker swarm is a distributed container orchestration tool, we need manager and worker nodes. Thus, we can talk about fault tolerance. To keep high availability, we will have all 3 nodes as managers. For this, we can add the public-ip of all of them to your DNS, as traefik (the reverse proxy we will be using must run on managers to retrieve info about the swarm).

You should maintain an odd number of managers in the swarm to support manager node failures. Having an odd number of managers ensures that during a network partition, there is a higher chance that the quorum remains available to process requests if the network is partitioned into two sets. Keeping the quorum is not guaranteed if you encounter more than two network partitions. https://docs.docker.com/engine/swarm/admin_guide/#add-manager-nodes-for-fault-tolerance

Our architecture will look something like this (if you decide to bootstrap included services): traefik_arch The DNS will point to our manager nodes (in this case, all of them) and traefik will be exposed on all on port 80 (HTTP) and 443 (HTTPS).

๐ŸดPreparation

The first thing we need to do after getting our hardware ready is create a hosts.ini file. You must have at least one manager node.

[node0]  # The node you will use to run init the swarm (must be a manager)
1.2.3.4

[all]  # all nodes public ips
1.2.3.4
1.2.3.5
1.2.3.6

[managers]  # manager nodes public ips
1.2.3.4

[workers]  # worker nodes public ips
1.2.3.5
1.2.3.6

In this demo, we assume the default user is ubuntu. But that can be changed on the header of every file inside the playbooks directory.

---
- name: Setup Oracle VMs
  hosts: all
  become: true
  remote_user: ubuntu # <- This is the user
  ...

๐Ÿณ Creating the Cluster

After creating your hosts file and configuring the default remote user, we can start running the playbooks.

To setup the cluster and install dependencies:

NOTE: this command DISABLES iptables firewall, do NOT host services on bare-metal after this.

ansible-playbook -i hosts/hosts.ini playbooks/setup.yaml

Then, to initiailze the cluster:

ansible-playbook -i hosts/hosts.ini playbooks/bootstrap_swarm.yaml

The script already takes care of the different swarm join tokens, so there is no need for extra configuration.

If anything goes wrong or you just want to dismantle the swarm, simply run:

ansible-playbook -i hosts/hosts.ini playbooks/dismantle_swarm.yaml

๐Ÿšš Base Services

If you also want to already bootstrap some base services, you can use this section to do so. The services that will be installed here are:

  • Traefik - reverse proxy to access the cluster services
  • Portainer - container orchestration web UI
  • Registry - container (private) registry for your docker images - Note that we are going to be using simple HttpAuth, check this for other options
  • Swarmpit - Simple hardware monitoring solution used for the cluster (also does simpler container orchestration and is mobile friendly!)

To bootstrap these services, we'll need to do a tiny bit more configuring. To use traefik, well need a domain name, and since in this example we use it to create SSL certificates, we need a maintainer email. To configure it, g to bootstrap_essential_service.yaml and check the vars section:

---
- name: Bootstrap Essentials
  hosts: node0
  remote_user: ubuntu
  vars:
    domain_name: "cloud.example.com" # <- your domain
    maintainer_email: "my.email@email.com" # <- your email
    basic_auth_password: "adminPass" # <- registry and traefik http password

After configuring it, simply run:

ansible-playbook -i hosts/hosts.ini ansible/bootstrap_essential_services.yaml

Traefik will take a few moments to generate the TLS certificates but after that, you can access those services with their subdomain. For example:

Portainer: portainer.cloud.example.com

Remember that in the case of portainer, you have a limited ammount of time to access it and create the admin user, if you don't, you'll need to restart the container service (on a manager node: docker service update portainer_portainer).

๐Ÿ—’๏ธ NOTEs:

  • traefik http_pass config is: admin:adminPass, to change it, take a look at traefik/create_pass.sh and traefik/docker-compose.yaml.
  • the portainer version we are running is the Community Edition (CE), you can run the Enterprise Edition (EE) for free for up to 3-nodes it gives some pretty cool functionality to update services automatically with github actions (simple POST request) for example, access to the private registry and more.

๐Ÿ‘Ÿ Running your own services

Since we have a small cluster with very limited resources, it is pretty important to set resource limits in the compose files. To create the routes in traefik, you must add these labels to the deploy segment in the compose file:

deploy:
  labels:
    - "traefik.enable=true"
    - "traefik.http.routers.${MY_SERVICE_NAME}.rule=Host(`${SUBDOMAIN_TO_REDIRECT}.${DOMAIN_NAME}`)"
    - "traefik.http.services.${MY_SERVICE_NAME}.loadbalancer.server.port=${TARGET_PORT}"
    - "traefik.http.routers.${MY_SERVICE_NAME}.entrypoints=websecure"
    - "traefik.http.routers.${MY_SERVICE_NAME}.tls=true"
    - "traefik.http.routers.${MY_SERVICE_NAME}.tls.certresolver=leresolver"

In this yaml snippet, we have 4 vars:

  • MY_SERVICE_NAME: The name of the service in the compose file (i.e. "app")
  • SUBDOMAIN_TO_REDIRECT: The subdomain used to redirect to that service
  • DOMAIN_NAME: Your domain name (can be used in conjunction with the subdomain to redirect from a whole new domain)
  • TARGET_PORT: The port where the service is running in it's container

๐Ÿช Thanks

This repo was somewhat inspired by tecno-tim's k3s-ansible. Check him out!