k3s-io/k3s

Restore multiple (master) servers from etcd snapshot

StarpTech opened this issue · 6 comments

Is your feature request related to a problem? Please describe.
Yes, the current documentation only describes how to restore from a single master server setup.

Describe the solution you'd like
It should be possible to restore a snapshot and distribute it to all other servers as described in (rke) https://rancher.com/docs/rke/latest/en/etcd-snapshots/#how-restoring-from-a-snapshot-works

Describe alternatives you've considered
Documentation and automation of how to do it safely with the current implementation. My instructions were as follows:

  1. Stop the master server.
sudo systemctl stop K3s
  1. Restore the master server from a snapshot
./k3s server \
  --cluster-reset \
  --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
  1. Connect you with the different servers and run:
sudo systemctl stop K3s
rm -rf /var/lib/rancher/k3s/data
sudo systemctl start K3s
  1. Cluster is healthy

Additional informations

Cluster was installed with https://github.com/StarpTech/k-andy

We don't have a central coordination tool like RKE, and no plans to create one. After restoring the snapshot to the first server, you should remove the database files on the other servers and rejoin them to the cluster.

Hi @brandond so the workaround is correct? What's the strategy in the long term to handle restore scenarios in large clusters?

Long term, automation of this sort will likely be handled by Rancher cluster operator orchestration.

Could we document the restore procedure of the current implementation with multiple master nodes? I'm not sure if this is the exact right approach.

Follow the restore instructions from the docs. When the restore is complete you will see a message on the console:

logrus.Infof("Etcd is running, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes")

Follow those instructions - stop k3s on the other servers (if it is still running), delete the referenced file, then start k3s again to rejoin the cluster.

Thanks, I didn't recognize the last line.