canonical/microcloud

Removing a Microcloud cluster member does not remove the underlying LXD cluster member

Opened this issue · 6 comments

Having a simple 3 nodes cluster configuration like so:

root@v3:~# microcloud cluster list
+------+-------------------+-------+------------------------------------------------------------------+--------+
| NAME |      ADDRESS      |  ROLE |                           FINGERPRINT                            | STATUS |
+------+-------------------+-------+------------------------------------------------------------------+--------+
| v1   | 10.10.10.67:9443  | voter | 3d4140ec40d677b2a9a4870511b144f795578f0007d32cdef962a177cf152286 | ONLINE |
+------+-------------------+-------+------------------------------------------------------------------+--------+
| v2   | 10.10.10.217:9443 | voter | 621fe0a5e252b80764fc0528e269046ff583d4e52ac17f980fdbf71a177890e6 | ONLINE |
+------+-------------------+------+------------------------------------------------------------------+--------+
| v3   | 10.10.10.86:9443  | voter | 0967c4417e555d1bf79f345ffaa6c6c1eb1b0e8ddd73b682980860f689f998e4 | ONLINE |
+------+-------------------+-------+------------------------------------------------------------------+--------+

When I want to remove a microcloud node with microcloud cluster remove v3 for example, this works as expected (for example, I go on v2 a list the microcloud members)

root@v2:~# microcloud cluster list
+------+-------------------+-------+------------------------------------------------------------------+--------+
| NAME |      ADDRESS      | ROLE  |                           FINGERPRINT                            | STATUS |
+------+-------------------+-------+------------------------------------------------------------------+--------+
| v1   | 10.10.10.67:9443  | voter | 3d4140ec40d677b2a9a4870511b144f795578f0007d32cdef962a177cf152286 | ONLINE |
+------+-------------------+-------+------------------------------------------------------------------+--------+
| v2   | 10.10.10.217:9443 | spare | 621fe0a5e252b80764fc0528e269046ff583d4e52ac17f980fdbf71a177890e6 | ONLINE |
+------+-------------------+-------+------------------------------------------------------------------+--------+

But on every node, if I do a lxc cluster list, I see all the members:

root@v3:~# lxc cluster list
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME |            URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| v1   | https://10.10.10.67:8443  | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| v2   | https://10.10.10.217:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| v3   | https://10.10.10.86:8443  | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|      |                           | database        |              |                |             |        |                   |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+

This behaviour is not very 'symmetric' with microcloud init that creates underlying LXD cluster members. I would expect microcloud cluster remove <node_name> to remove the underlying LXD cluster member (the one listed with lxc cluster list) as well.

I'm also curious to know how it behaves with microceph/microovn: does a microcloud cluster remove <node_name> triggers an automatic microceph cluster remove <node_name> / microovn cluster remove <node_name> as well ? I don't know what is the expected behaviour here, but I'd say that if we remove a microcloud node, we also would like to remove its associated node in the microceph / microovm cluster as they are meant to work all together..

@masnax @markylaing do you know what the expected behaviour here is? Thanks

It looks like the CLI only removes the microcluster member and does not make any calls to LXD, Ceph, or OVN

func (c *cmdClusterMemberRemove) Run(cmd *cobra.Command, args []string) error {
if len(args) != 1 {
return cmd.Help()
}
options := microcluster.Args{StateDir: c.common.FlagMicroCloudDir, Verbose: c.common.FlagLogVerbose, Debug: c.common.FlagLogDebug}
m, err := microcluster.App(context.Background(), options)
if err != nil {
return err
}
client, err := m.LocalClient()
if err != nil {
return err
}
err = client.DeleteClusterMember(context.Background(), args[0], c.flagForce)
if err != nil {
return err
}
return nil
}

I agree with @gabrielmougard this should remove the node from all of them. We will need to figure out what to do with running instances, especially those on local storage.

@markylaing there is this #33, which previously mentionned the problem we're trying to solve.

masnax commented

I think it would be fair to error out if trying to remove a node with local instances. The user should sort out what they want to do with those instances first before removing the node. Maybe a force flag can nuke the node and its instances if it's unresponsive. Ceph instances can be moved, though that poses whether that should be according LXDs cluster scheduling or user-defined.

I think it would make sense for the time being to look into adding a Remove function for each service that calls the respective cluster remove API hook.

Supposedly MicroOVN fully supports this already, so that one is straightforward.

LXD can check for local instances and fail if --force is not given

MicroCeph won't work for now though, so we will need to error if that's installed.

We could have an IsRemovable function that performs these validations on all services before progressing to the Remove step.

Sounds good!

I think it would make sense for the time being to look into adding a Remove function for each service that calls the respective cluster remove API hook.

Supposedly MicroOVN fully supports this already, so that one is straightforward.

MicroOVN uses a microcluster hook to define how a member is removed:

Since microceph also uses microcluster it can do the same. We'll just need to implement the logic for LXD.