Failed to start instances on new cluster members
Opened this issue · 8 comments
Version
Same versions of the snaps on all cluster members.
root@m1:~# snap list
Name Version Rev Tracking Publisher Notes
core20 20230801 2015 latest/stable canonical✓ base
core22 20231123 1033 latest/stable canonical✓ base
lxd 5.19-8635f82 26200 latest/stable canonical✓ -
microceph 0+git.7b5672b 707 quincy/stable canonical✓ -
microcloud 1.1-04a1c49 734 latest/stable canonical✓ -
microovn 22.03.3+snap1d18f95c73 349 22.03/stable canonical✓ -
snapd 2.60.4 20290 latest/stable canonical✓ snapd
Description
After adding a new member to the MicroCloud cluster using microcloud add
, existing instances can be moved to the new cluster member but fail when getting started:
root@m3:~# lxc mv v1 --target m4
root@m3:~# lxc start v1
Error: Failed pre-start check for device "eth0": Network "default" unavailable on this server
Try `lxc info --show-log v1` for more info
The networks status on the new member is also marked as Unavailable:
root@m1:~# lxc network show default --target m4
config:
bridge.mtu: "1442"
ipv4.address: 10.85.238.1/24
ipv4.nat: "true"
ipv6.address: fd42:a345:26de:b041::1/64
ipv6.nat: "true"
network: UPLINK
volatile.network.ipv4.address: 10.247.231.100
description: ""
name: default
type: ovn
used_by:
- /1.0/instances/v1
- /1.0/profiles/default
managed: true
status: Unavailable
locations:
- m4
- m1
- m2
- m3
In the logs of m4
you can see the following message every minute:
Dec 12 14:44:31 m4 lxd.daemon[5657]: time="2023-12-12T14:44:31Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"default\" setup: Failed to run: ovn-nbctl --timeout=10 --db unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock --wait=sb --format=csv --no-headings --data=bare --colum=_uuid,name,acl find port_group name=lxd_net2: exit status 1 (ovn-nbctl: unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory))" network=default project=default
@roosterfish whenever you're reporting a (potential) cross-snap issue (or really anytime you're reporting a microcloud issue) it would be useful to see the output of snap list
on each server so we can get a view of precisely which snap revisions of microcloud, lxd, microceph and microovn are installed. Thanks
For now a workaround is to reload the LXD daemon on the affected cluster member using systemctl snap.lxd.daemon reload
. Afterwards the network reports the status Created
and can be used accordingly.
@roosterfish what do the LXD logs show for the error/reason for the network not being starable?
@roosterfish what do the LXD logs show for the error/reason for the network not being starable?
I have updated the description.
@roosterfish @masnax the unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock
reference suggests LXD was started before microovn was installed? As it doesn't seem to be using the microovn location. Is that right @masnax?
I managed to reproduce this too. I'm having a look at it. I have one question though: for a 4 nodes configuration, we have 3 ovn-central
services anyway to guarantee OVN HA (on m1, m2, m3, each with a /var/snap/microovn/common/run/ovn/ovnnb_db.sock
file) so the fourth node is not supposed to have a ovnnb_db.sock
anyway right (m4
only runs a ovn-chassis
and a ovn-switch
)? Is that right @tomponline ?
Looks like this is an issue with LXD cluster joins. It seems joining a cluster after the fact by using MemberConfig
sets up ovn differently than the initial creation of the cluster does.
I'm able to reproduce this only when adding nodes to an existing cluster, whereas using the same nodes and initializing the whole cluster at that size results in the network working fine.
I'm still trying to figure out what LXD's doing exactly, but what I've gathered from the request payloads so far is that when creating the network on init, the payloads look like
{NetworkPut:{Config:map[parent:enp6s0] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[] Description:Uplink for OVN networks} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[network:UPLINK] Description:Default OVN network} Name:default Type:ovn}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK parent:enp6s0] Description:} Name:default Type:ovn}"
and when adding a node, it's
{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK] Description:} Name:default Type:ovn}"
Main difference is that the parent
config field is set in for the default
network when initializing the cluster, but that's not a valid key for the ovn
network anyway.
Hm, indeed that actually was it. If that final payload has parent=enp6s0
set, then the network forms properly.