canonical/microcloud

Quick recovery after cluster member failure, how?

Opened this issue · 1 comments

Testing Microcloud further here and I wanted to simulate a cluster member catastrophic failure. Easy enough to do in my sandbox setup where I have microcloud set up on 3vms on the same physical host.

I simply ran "lxc stop --force" on a vm which had a running lxd container on it, in order to simulate a crash.

I then - maybe naively - assumed that the amazing thing with clustered lxd and ceph would be that I should be able to just spin up the container immediately on another cluster member. Right? However, I had a hard time finding information regarding recommended steps online, so I just tried the following:

# lxc exec lxdvm1 bash
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
|   NAME   | STATE | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR |      |      | CONTAINER | 0         | lxdvm2   |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1
Error: Get "https://10.1.255.88:8443/1.0/instances/lxdtest1": Unable to connect to: 10.1.255.88:8443 ([dial tcp 10.1.255.88:8443: connect: no route to host])
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
|   NAME   | STATE | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR |      |      | CONTAINER | 0         | lxdvm2   |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc move lxdtest1 --target lxdvm1
root@lxdvm1:~# lxc ls
+----------+---------+------+------+-----------+-----------+----------+
|   NAME   |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+---------+------+------+-----------+-----------+----------+
| lxdtest1 | STOPPED |      |      | CONTAINER | 1         | lxdvm1   |
+----------+---------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1


Error: User signaled us three times, exiting. The remote operation will keep running
Try `lxc info --show-log lxdtest1` for more info

As you can see the lxc start just hangs. I tried a few times but it just sits there. lxc info --show-log reveals nothing useful.

Is this not how it's supposed to work? Surely quickly being able to recover from a node going down is one of the core points of all this clustering/ceph goodness, or am I just thinking about this wrong? :)

Thank you for any insights you can provide here

After some more testing today, on a second test run this just worked. I found an old forum post from Stephane Graber that this is indeed the way to do it. I am happy this works, but I am at a loss why it didn't work the first time.

Yesterday I tried running lxc monitor to see what was happening, and the start operation was processed but left in a "pending" state it seemed. What lxd was waiting for and why, I can't tell. I suspect it was my environment and related to networking as I've been having some problems with that on the lxdvm1 instance - so this is probably just my messy test environment to blame here.

Anyway, this works - for now - I'll update here if I hit this particular issue again as I'll be doing a lot of testing in the coming week or so for various scenarios. If I see noting further I'll make sure to close this.