Failed to start instances on new cluster members
roosterfish opened this issue · 13 comments
Version
Same versions of the snaps on all cluster members.
root@m1:~# snap list
Name Version Rev Tracking Publisher Notes
core20 20230801 2015 latest/stable canonical✓ base
core22 20231123 1033 latest/stable canonical✓ base
lxd 5.19-8635f82 26200 latest/stable canonical✓ -
microceph 0+git.7b5672b 707 quincy/stable canonical✓ -
microcloud 1.1-04a1c49 734 latest/stable canonical✓ -
microovn 22.03.3+snap1d18f95c73 349 22.03/stable canonical✓ -
snapd 2.60.4 20290 latest/stable canonical✓ snapd
Description
After adding a new member to the MicroCloud cluster using microcloud add
, existing instances can be moved to the new cluster member but fail when getting started:
root@m3:~# lxc mv v1 --target m4
root@m3:~# lxc start v1
Error: Failed pre-start check for device "eth0": Network "default" unavailable on this server
Try `lxc info --show-log v1` for more info
The networks status on the new member is also marked as Unavailable:
root@m1:~# lxc network show default --target m4
config:
bridge.mtu: "1442"
ipv4.address: 10.85.238.1/24
ipv4.nat: "true"
ipv6.address: fd42:a345:26de:b041::1/64
ipv6.nat: "true"
network: UPLINK
volatile.network.ipv4.address: 10.247.231.100
description: ""
name: default
type: ovn
used_by:
- /1.0/instances/v1
- /1.0/profiles/default
managed: true
status: Unavailable
locations:
- m4
- m1
- m2
- m3
In the logs of m4
you can see the following message every minute:
Dec 12 14:44:31 m4 lxd.daemon[5657]: time="2023-12-12T14:44:31Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"default\" setup: Failed to run: ovn-nbctl --timeout=10 --db unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock --wait=sb --format=csv --no-headings --data=bare --colum=_uuid,name,acl find port_group name=lxd_net2: exit status 1 (ovn-nbctl: unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory))" network=default project=default
@roosterfish whenever you're reporting a (potential) cross-snap issue (or really anytime you're reporting a microcloud issue) it would be useful to see the output of snap list
on each server so we can get a view of precisely which snap revisions of microcloud, lxd, microceph and microovn are installed. Thanks
For now a workaround is to reload the LXD daemon on the affected cluster member using systemctl snap.lxd.daemon reload
. Afterwards the network reports the status Created
and can be used accordingly.
@roosterfish what do the LXD logs show for the error/reason for the network not being starable?
@roosterfish what do the LXD logs show for the error/reason for the network not being starable?
I have updated the description.
@roosterfish @masnax the unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock
reference suggests LXD was started before microovn was installed? As it doesn't seem to be using the microovn location. Is that right @masnax?
I managed to reproduce this too. I'm having a look at it. I have one question though: for a 4 nodes configuration, we have 3 ovn-central
services anyway to guarantee OVN HA (on m1, m2, m3, each with a /var/snap/microovn/common/run/ovn/ovnnb_db.sock
file) so the fourth node is not supposed to have a ovnnb_db.sock
anyway right (m4
only runs a ovn-chassis
and a ovn-switch
)? Is that right @tomponline ?
Looks like this is an issue with LXD cluster joins. It seems joining a cluster after the fact by using MemberConfig
sets up ovn differently than the initial creation of the cluster does.
I'm able to reproduce this only when adding nodes to an existing cluster, whereas using the same nodes and initializing the whole cluster at that size results in the network working fine.
I'm still trying to figure out what LXD's doing exactly, but what I've gathered from the request payloads so far is that when creating the network on init, the payloads look like
{NetworkPut:{Config:map[parent:enp6s0] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[] Description:Uplink for OVN networks} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[network:UPLINK] Description:Default OVN network} Name:default Type:ovn}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK parent:enp6s0] Description:} Name:default Type:ovn}"
and when adding a node, it's
{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK] Description:} Name:default Type:ovn}"
Main difference is that the parent
config field is set in for the default
network when initializing the cluster, but that's not a valid key for the ovn
network anyway.
Hm, indeed that actually was it. If that final payload has parent=enp6s0
set, then the network forms properly.
From my initial testing, this appears to be fixed in LXD now. The default network is still created on the new cluster members without parent
set (which is valid, since parent
is a member-specific config and an ovn
type network has no member-specific configuration), but this no longer seems to affect the functionality of the network on that cluster member.
@roosterfish If you remember the initial setup you used to replicate this, could you please give it a shot to ensure I'm not missing an edge case?
It looks 5.21/stable
is still affected by this as I see the same error when starting the instance on the new member.
I suspect LXD 5.21/stable
is the one we will recommend installing when we release the MicroCloud LTS?
I have deployed the following set of snaps:
lxd 5.21.2-2f4ba6b 30131 5.21/stable canonical✓ in-cohort
microceph 0+git.4a608fc 793 quincy/stable canonical✓ in-cohort
microcloud 1.1-04a1c49 734 latest/stable canonical✓ in-cohort
microovn 22.03.3+snap0e23a0e4f5 395 22.03/stable canonical✓ in-cohort
The same error occurs with MicroCloud latest/edge
(I have used the latest Ceph in order to not get any errors with edge MicroCloud):
lxd 5.21.2-2f4ba6b 30131 5.21/stable canonical✓ in-cohort
microceph 19.2.0~git+snap36f71d7700 1148 latest/edge canonical✓ in-cohort
microcloud git-ebaa9ba 955 latest/edge canonical✓ in-cohort
microovn 22.03.3+snap0e23a0e4f5 395 22.03/stable canonical✓ in-cohort
And when using LXD latest/stable
the same error still appears:
lxd 6.1-78a3d8f 30130 latest/stable canonical✓ in-cohort
microceph 19.2.0~git+snap36f71d7700 1148 latest/edge canonical✓ in-cohort
microcloud git-ebaa9ba 955 latest/edge canonical✓ in-cohort
microovn 22.03.3+snap0e23a0e4f5 395 22.03/stable canonical✓ in-cohort
The reproducer steps:
- Bootstrap cluster with members m1, m2, m3
- Start a new instance v1
- Stop the instance
- Add member m4 to the cluster
- Move the stopped instance v1 to m4
- Start the instance v1
- Error on m4 (
snap logs lxd
):2024-09-04T08:39:53Z lxd.daemon[2881]: time="2024-09-04T08:39:53Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"default\" setup: Failed to run: ovn-nbctl --timeout=10 --db unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock --wait=sb --format=csv --no-headings --data=bare --colum=_uuid,name,acl find port_group name=lxd_net2: exit status 1 (ovn-nbctl: unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory))" network=default project=default
@roosterfish is 5.21/edge affected?
@roosterfish is 5.21/edge affected?
Mh, 5.21/edge
seems to be not affected. No error when starting the instance on m4
.
Using all the stable snaps:
lxd git-75a87af 30149 5.21/edge canonical✓ in-cohort
microceph 0+git.4a608fc 793 quincy/stable canonical✓ in-cohort
microcloud 1.1-04a1c49 734 latest/stable canonical✓ in-cohort
microovn 22.03.3+snap0e23a0e4f5 395 22.03/stable canonical✓ in-cohort
@roosterfish is 5.21/edge affected?
Mh,
5.21/edge
seems to be not affected. No error when starting the instance onm4
.Using all the stable snaps:
lxd git-75a87af 30149 5.21/edge canonical✓ in-cohort microceph 0+git.4a608fc 793 quincy/stable canonical✓ in-cohort microcloud 1.1-04a1c49 734 latest/stable canonical✓ in-cohort microovn 22.03.3+snap0e23a0e4f5 395 22.03/stable canonical✓ in-cohort
Great, so its been fixed in a backport. And will be in 5.21.3.