canonical/microcloud

Removing a machine that failed to add properly leaves internal tokens behind

Opened this issue · 6 comments

I tried to add a new machine to my microcloud cluster, and it failed because of the networking configuration. However, microcloud still saw the new machine as a cluster member, so I could not try to add it back again.

I ran sudo microcloud cluster remove <name> --force and tried to add the machine again, and got the error:

Error: Failed to issue MicroCloud token for peer "<name>": Failed to create "internal_token_records" entry: UNIQUE constraint failed: internal_token_records.name

I poked around a bit and tried to manually remove the token:

sudo microcloud sql "DELETE FROM internal_token_records WHERE name='<name>'"

Running a SELECT afterwards shows that the token is gone. However, trying to add the machine again fails with the same error, and a new token is present in the table.

Running the add command with debug and verbose does not give more information. I can also see the machine's IP in /var/snap/microcloud/common/state/database/cluster.yaml.

The state that is left behind prevents me from trying again to add that machine.

Looking specifically in microovn and microceph, the machine was also present in the cluster list, so I also removed it from there, but the issue remains.

Removing nodes is not fully supported at the moment. The cluster remove command only applies to the microcloud daemon, and not microceph, microovn, or lxd. As you ran the command again, each of those daemons may have registered a token so you will have to clear out the join token on all 4 daemons as well.

For each app, you can use sql like you did for microcloud, except with LXD it's got the command lxc cluster revoke-token. Then remove the cluster member with cluster remove --force and finally ensure none of the systems across any of the 3 micro apps have a <name>.yaml file at /var/snap/<app>/common/state/truststore corresponding to the node you just removed.

Thank you, this seems to have done the trick. I am still unable to successfully add the node, and with each try I have to go through the procedure you provided, remove and purge the snaps on the new node, reboot, and try again. The error I am currently getting is a context deadline exceeded, which is not super useful.

It would be great that if the add fails, microcloud cleans up properly. It would make it a lot easier to test configurations.

At what point are you receiving that error? What's the prior output?

Did the node fail to join any cluster? Does it appear in cluster list for any of microovn, microceph, lxd, microcloud?

I do not have the output anymore, but it is currently showing in the cluster list for everything except lxd.

OK then it looks like LXD is not cleaning up properly after a failed join. I bet it's connected to canonical/lxd#12624. It's probably salvageable if you manually add some dummy entries into the LXD database corresponding to the failed node with the sql command and then call cluster remove --force from LXD.

There's a couple tables that are important here, notably certificates, nodes, nodes_roles, and maybe nodes_cluster_groups. For LXD you'll have to inspect both the global database and the local one.