Removing a Microcloud cluster member does not remove the underlying LXD cluster member
Opened this issue · 6 comments
Having a simple 3 nodes cluster configuration like so:
root@v3:~# microcloud cluster list
+------+-------------------+-------+------------------------------------------------------------------+--------+
| NAME | ADDRESS | ROLE | FINGERPRINT | STATUS |
+------+-------------------+-------+------------------------------------------------------------------+--------+
| v1 | 10.10.10.67:9443 | voter | 3d4140ec40d677b2a9a4870511b144f795578f0007d32cdef962a177cf152286 | ONLINE |
+------+-------------------+-------+------------------------------------------------------------------+--------+
| v2 | 10.10.10.217:9443 | voter | 621fe0a5e252b80764fc0528e269046ff583d4e52ac17f980fdbf71a177890e6 | ONLINE |
+------+-------------------+------+------------------------------------------------------------------+--------+
| v3 | 10.10.10.86:9443 | voter | 0967c4417e555d1bf79f345ffaa6c6c1eb1b0e8ddd73b682980860f689f998e4 | ONLINE |
+------+-------------------+-------+------------------------------------------------------------------+--------+
When I want to remove a microcloud node with microcloud cluster remove v3
for example, this works as expected (for example, I go on v2
a list the microcloud members)
root@v2:~# microcloud cluster list
+------+-------------------+-------+------------------------------------------------------------------+--------+
| NAME | ADDRESS | ROLE | FINGERPRINT | STATUS |
+------+-------------------+-------+------------------------------------------------------------------+--------+
| v1 | 10.10.10.67:9443 | voter | 3d4140ec40d677b2a9a4870511b144f795578f0007d32cdef962a177cf152286 | ONLINE |
+------+-------------------+-------+------------------------------------------------------------------+--------+
| v2 | 10.10.10.217:9443 | spare | 621fe0a5e252b80764fc0528e269046ff583d4e52ac17f980fdbf71a177890e6 | ONLINE |
+------+-------------------+-------+------------------------------------------------------------------+--------+
But on every node, if I do a lxc cluster list
, I see all the members:
root@v3:~# lxc cluster list
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| v1 | https://10.10.10.67:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| v2 | https://10.10.10.217:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| v3 | https://10.10.10.86:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
This behaviour is not very 'symmetric' with microcloud init
that creates underlying LXD cluster members. I would expect microcloud cluster remove <node_name>
to remove the underlying LXD cluster member (the one listed with lxc cluster list
) as well.
I'm also curious to know how it behaves with microceph/microovn: does a microcloud cluster remove <node_name>
triggers an automatic microceph cluster remove <node_name>
/ microovn cluster remove <node_name>
as well ? I don't know what is the expected behaviour here, but I'd say that if we remove a microcloud node, we also would like to remove its associated node in the microceph / microovm cluster as they are meant to work all together..
@masnax @markylaing do you know what the expected behaviour here is? Thanks
It looks like the CLI only removes the microcluster member and does not make any calls to LXD, Ceph, or OVN
microcloud/microcloud/cmd/microcloud/cluster_members.go
Lines 118 to 140 in da6d8a4
I agree with @gabrielmougard this should remove the node from all of them. We will need to figure out what to do with running instances, especially those on local storage.
@markylaing there is this #33, which previously mentionned the problem we're trying to solve.
I think it would be fair to error out if trying to remove a node with local instances. The user should sort out what they want to do with those instances first before removing the node. Maybe a force
flag can nuke the node and its instances if it's unresponsive. Ceph instances can be moved, though that poses whether that should be according LXDs cluster scheduling or user-defined.
I think it would make sense for the time being to look into adding a Remove
function for each service that calls the respective cluster remove API hook.
Supposedly MicroOVN fully supports this already, so that one is straightforward.
LXD can check for local instances and fail if --force
is not given
MicroCeph won't work for now though, so we will need to error if that's installed.
We could have an IsRemovable
function that performs these validations on all services before progressing to the Remove
step.
Sounds good!
I think it would make sense for the time being to look into adding a
Remove
function for each service that calls the respective cluster remove API hook.Supposedly MicroOVN fully supports this already, so that one is straightforward.
MicroOVN uses a microcluster hook to define how a member is removed:
- https://github.com/canonical/microovn/blob/da2e39f55c95a9a02b1d329d43b2e991511c047a/microovn/ovn/leave.go
- https://github.com/canonical/microovn/blob/da2e39f55c95a9a02b1d329d43b2e991511c047a/microovn/cmd/microovnd/main.go#L72
Since microceph also uses microcluster it can do the same. We'll just need to implement the logic for LXD.