Failed to start instances on new cluster members

Question

Failed to start instances on new cluster members

roosterfish opened this issue a year ago · 13 comments

Version

Same versions of the snaps on all cluster members.

root@m1:~# snap list
Name        Version                 Rev    Tracking       Publisher   Notes
core20      20230801                2015   latest/stable  canonical✓  base
core22      20231123                1033   latest/stable  canonical✓  base
lxd         5.19-8635f82            26200  latest/stable  canonical✓  -
microceph   0+git.7b5672b           707    quincy/stable  canonical✓  -
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  -
microovn    22.03.3+snap1d18f95c73  349    22.03/stable   canonical✓  -
snapd       2.60.4                  20290  latest/stable  canonical✓  snapd

Description

After adding a new member to the MicroCloud cluster using microcloud add, existing instances can be moved to the new cluster member but fail when getting started:

root@m3:~# lxc mv v1 --target m4
root@m3:~# lxc start v1
Error: Failed pre-start check for device "eth0": Network "default" unavailable on this server
Try `lxc info --show-log v1` for more info

The networks status on the new member is also marked as Unavailable:

root@m1:~# lxc network show default --target m4
config:
  bridge.mtu: "1442"
  ipv4.address: 10.85.238.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:a345:26de:b041::1/64
  ipv6.nat: "true"
  network: UPLINK
  volatile.network.ipv4.address: 10.247.231.100
description: ""
name: default
type: ovn
used_by:
- /1.0/instances/v1
- /1.0/profiles/default
managed: true
status: Unavailable
locations:
- m4
- m1
- m2
- m3

In the logs of m4 you can see the following message every minute:

Dec 12 14:44:31 m4 lxd.daemon[5657]: time="2023-12-12T14:44:31Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"default\" setup: Failed to run: ovn-nbctl --timeout=10 --db unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock --wait=sb --format=csv --no-headings --data=bare --colum=_uuid,name,acl find port_group name=lxd_net2: exit status 1 (ovn-nbctl: unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory))" network=default project=default

Answer 1 · 2023-12-12T13:26:56.000Z

@roosterfish whenever you're reporting a (potential) cross-snap issue (or really anytime you're reporting a microcloud issue) it would be useful to see the output of snap list on each server so we can get a view of precisely which snap revisions of microcloud, lxd, microceph and microovn are installed. Thanks

Answer 2 · 2023-12-12T15:07:21.000Z

For now a workaround is to reload the LXD daemon on the affected cluster member using systemctl snap.lxd.daemon reload. Afterwards the network reports the status Created and can be used accordingly.

Answer 3 · 2023-12-12T15:08:18.000Z

@roosterfish what do the LXD logs show for the error/reason for the network not being starable?

Answer 4 · 2023-12-12T15:10:34.000Z

@roosterfish what do the LXD logs show for the error/reason for the network not being starable?

I have updated the description.

Answer 5 · 2023-12-12T15:17:54.000Z

@roosterfish @masnax the unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock reference suggests LXD was started before microovn was installed? As it doesn't seem to be using the microovn location. Is that right @masnax?

Answer 6 · 2024-01-03T13:22:04.000Z

I managed to reproduce this too. I'm having a look at it. I have one question though: for a 4 nodes configuration, we have 3 ovn-central services anyway to guarantee OVN HA (on m1, m2, m3, each with a /var/snap/microovn/common/run/ovn/ovnnb_db.sock file) so the fourth node is not supposed to have a ovnnb_db.sock anyway right (m4 only runs a ovn-chassis and a ovn-switch)? Is that right @tomponline ?

Answer 7 · 2024-01-11T02:13:15.000Z

Looks like this is an issue with LXD cluster joins. It seems joining a cluster after the fact by using MemberConfig sets up ovn differently than the initial creation of the cluster does.

I'm able to reproduce this only when adding nodes to an existing cluster, whereas using the same nodes and initializing the whole cluster at that size results in the network working fine.

I'm still trying to figure out what LXD's doing exactly, but what I've gathered from the request payloads so far is that when creating the network on init, the payloads look like

{NetworkPut:{Config:map[parent:enp6s0] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[] Description:Uplink for OVN networks} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[network:UPLINK] Description:Default OVN network} Name:default Type:ovn}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK parent:enp6s0] Description:} Name:default Type:ovn}"

and when adding a node, it's

{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK] Description:} Name:default Type:ovn}"

Main difference is that the parent config field is set in for the default network when initializing the cluster, but that's not a valid key for the ovn network anyway.

Answer 8 · 2024-01-11T02:34:52.000Z

Hm, indeed that actually was it. If that final payload has parent=enp6s0 set, then the network forms properly.

Answer 9 · 2024-09-03T17:45:49.000Z

From my initial testing, this appears to be fixed in LXD now. The default network is still created on the new cluster members without parent set (which is valid, since parent is a member-specific config and an ovn type network has no member-specific configuration), but this no longer seems to affect the functionality of the network on that cluster member.

@roosterfish If you remember the initial setup you used to replicate this, could you please give it a shot to ensure I'm not missing an edge case?

Answer 10 · 2024-09-04T08:42:55.000Z

It looks 5.21/stable is still affected by this as I see the same error when starting the instance on the new member.

I suspect LXD 5.21/stable is the one we will recommend installing when we release the MicroCloud LTS?

I have deployed the following set of snaps:

lxd         5.21.2-2f4ba6b          30131  5.21/stable    canonical✓  in-cohort
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort

The same error occurs with MicroCloud latest/edge (I have used the latest Ceph in order to not get any errors with edge MicroCloud):

lxd         5.21.2-2f4ba6b             30131  5.21/stable    canonical✓  in-cohort
microceph   19.2.0~git+snap36f71d7700  1148   latest/edge    canonical✓  in-cohort
microcloud  git-ebaa9ba                955    latest/edge    canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5     395    22.03/stable   canonical✓  in-cohort

And when using LXD latest/stable the same error still appears:

lxd         6.1-78a3d8f                30130  latest/stable  canonical✓  in-cohort
microceph   19.2.0~git+snap36f71d7700  1148   latest/edge    canonical✓  in-cohort
microcloud  git-ebaa9ba                955    latest/edge    canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5     395    22.03/stable   canonical✓  in-cohort

The reproducer steps:

Bootstrap cluster with members m1, m2, m3
Start a new instance v1
Stop the instance
Add member m4 to the cluster
Move the stopped instance v1 to m4
Start the instance v1
Error on m4 (snap logs lxd): 2024-09-04T08:39:53Z lxd.daemon[2881]: time="2024-09-04T08:39:53Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"default\" setup: Failed to run: ovn-nbctl --timeout=10 --db unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock --wait=sb --format=csv --no-headings --data=bare --colum=_uuid,name,acl find port_group name=lxd_net2: exit status 1 (ovn-nbctl: unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory))" network=default project=default

Answer 11 · 2024-09-04T08:46:12.000Z

@roosterfish is 5.21/edge affected?

Answer 12 · 2024-09-04T08:55:51.000Z

@roosterfish is 5.21/edge affected?

Mh, 5.21/edge seems to be not affected. No error when starting the instance on m4.

Using all the stable snaps:

lxd         git-75a87af             30149  5.21/edge      canonical✓  in-cohort
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort

Answer 13 · 2024-09-04T08:57:30.000Z

@roosterfish is 5.21/edge affected?

Mh, 5.21/edge seems to be not affected. No error when starting the instance on m4.

Using all the stable snaps:

lxd         git-75a87af             30149  5.21/edge      canonical✓  in-cohort
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort

Great, so its been fixed in a backport. And will be in 5.21.3.