Quick recovery after cluster member failure, how?
webdock-io opened this issue · 2 comments
Testing Microcloud further here and I wanted to simulate a cluster member catastrophic failure. Easy enough to do in my sandbox setup where I have microcloud set up on 3vms on the same physical host.
I simply ran "lxc stop --force" on a vm which had a running lxd container on it, in order to simulate a crash.
I then - maybe naively - assumed that the amazing thing with clustered lxd and ceph would be that I should be able to just spin up the container immediately on another cluster member. Right? However, I had a hard time finding information regarding recommended steps online, so I just tried the following:
# lxc exec lxdvm1 bash
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR | | | CONTAINER | 0 | lxdvm2 |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1
Error: Get "https://10.1.255.88:8443/1.0/instances/lxdtest1": Unable to connect to: 10.1.255.88:8443 ([dial tcp 10.1.255.88:8443: connect: no route to host])
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR | | | CONTAINER | 0 | lxdvm2 |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc move lxdtest1 --target lxdvm1
root@lxdvm1:~# lxc ls
+----------+---------+------+------+-----------+-----------+----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+----------+---------+------+------+-----------+-----------+----------+
| lxdtest1 | STOPPED | | | CONTAINER | 1 | lxdvm1 |
+----------+---------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1
Error: User signaled us three times, exiting. The remote operation will keep running
Try `lxc info --show-log lxdtest1` for more info
As you can see the lxc start just hangs. I tried a few times but it just sits there. lxc info --show-log reveals nothing useful.
Is this not how it's supposed to work? Surely quickly being able to recover from a node going down is one of the core points of all this clustering/ceph goodness, or am I just thinking about this wrong? :)
Thank you for any insights you can provide here
After some more testing today, on a second test run this just worked. I found an old forum post from Stephane Graber that this is indeed the way to do it. I am happy this works, but I am at a loss why it didn't work the first time.
Yesterday I tried running lxc monitor to see what was happening, and the start operation was processed but left in a "pending" state it seemed. What lxd was waiting for and why, I can't tell. I suspect it was my environment and related to networking as I've been having some problems with that on the lxdvm1 instance - so this is probably just my messy test environment to blame here.
Anyway, this works - for now - I'll update here if I hit this particular issue again as I'll be doing a lot of testing in the coming week or so for various scenarios. If I see noting further I'll make sure to close this.
@webdock-io Hi, do you still see this issue?