canonical/microcloud

Failed to start instances on new cluster members

Opened this issue · 8 comments

Version

Same versions of the snaps on all cluster members.

root@m1:~# snap list
Name        Version                 Rev    Tracking       Publisher   Notes
core20      20230801                2015   latest/stable  canonical✓  base
core22      20231123                1033   latest/stable  canonical✓  base
lxd         5.19-8635f82            26200  latest/stable  canonical✓  -
microceph   0+git.7b5672b           707    quincy/stable  canonical✓  -
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  -
microovn    22.03.3+snap1d18f95c73  349    22.03/stable   canonical✓  -
snapd       2.60.4                  20290  latest/stable  canonical✓  snapd

Description

After adding a new member to the MicroCloud cluster using microcloud add, existing instances can be moved to the new cluster member but fail when getting started:

root@m3:~# lxc mv v1 --target m4
root@m3:~# lxc start v1
Error: Failed pre-start check for device "eth0": Network "default" unavailable on this server
Try `lxc info --show-log v1` for more info

The networks status on the new member is also marked as Unavailable:

root@m1:~# lxc network show default --target m4
config:
  bridge.mtu: "1442"
  ipv4.address: 10.85.238.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:a345:26de:b041::1/64
  ipv6.nat: "true"
  network: UPLINK
  volatile.network.ipv4.address: 10.247.231.100
description: ""
name: default
type: ovn
used_by:
- /1.0/instances/v1
- /1.0/profiles/default
managed: true
status: Unavailable
locations:
- m4
- m1
- m2
- m3

In the logs of m4 you can see the following message every minute:

Dec 12 14:44:31 m4 lxd.daemon[5657]: time="2023-12-12T14:44:31Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"default\" setup: Failed to run: ovn-nbctl --timeout=10 --db unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock --wait=sb --format=csv --no-headings --data=bare --colum=_uuid,name,acl find port_group name=lxd_net2: exit status 1 (ovn-nbctl: unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory))" network=default project=default

@roosterfish whenever you're reporting a (potential) cross-snap issue (or really anytime you're reporting a microcloud issue) it would be useful to see the output of snap list on each server so we can get a view of precisely which snap revisions of microcloud, lxd, microceph and microovn are installed. Thanks

For now a workaround is to reload the LXD daemon on the affected cluster member using systemctl snap.lxd.daemon reload. Afterwards the network reports the status Created and can be used accordingly.

@roosterfish what do the LXD logs show for the error/reason for the network not being starable?

@roosterfish what do the LXD logs show for the error/reason for the network not being starable?

I have updated the description.

@roosterfish @masnax the unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock reference suggests LXD was started before microovn was installed? As it doesn't seem to be using the microovn location. Is that right @masnax?

I managed to reproduce this too. I'm having a look at it. I have one question though: for a 4 nodes configuration, we have 3 ovn-central services anyway to guarantee OVN HA (on m1, m2, m3, each with a /var/snap/microovn/common/run/ovn/ovnnb_db.sock file) so the fourth node is not supposed to have a ovnnb_db.sock anyway right (m4 only runs a ovn-chassis and a ovn-switch)? Is that right @tomponline ?

Looks like this is an issue with LXD cluster joins. It seems joining a cluster after the fact by using MemberConfig sets up ovn differently than the initial creation of the cluster does.

I'm able to reproduce this only when adding nodes to an existing cluster, whereas using the same nodes and initializing the whole cluster at that size results in the network working fine.

I'm still trying to figure out what LXD's doing exactly, but what I've gathered from the request payloads so far is that when creating the network on init, the payloads look like

{NetworkPut:{Config:map[parent:enp6s0] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[] Description:Uplink for OVN networks} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[network:UPLINK] Description:Default OVN network} Name:default Type:ovn}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK parent:enp6s0] Description:} Name:default Type:ovn}"

and when adding a node, it's

{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK] Description:} Name:default Type:ovn}"

Main difference is that the parent config field is set in for the default network when initializing the cluster, but that's not a valid key for the ovn network anyway.

Hm, indeed that actually was it. If that final payload has parent=enp6s0 set, then the network forms properly.