canonical/microcloud

Launching instances fails with `ovn-nbctl: database connection failed`

Closed this issue · 9 comments

Intermittently, when launching VMs and containers on LXD with an ovn network through MicroOVN, the instances will fail to be created with the error:

Error: Failed instance creation: Failed creating instance record: Failed initialising instance: 
Failed to add device "eth0": Failed adding DNS record: Failed to run: ovn-nbctl --timeout=10 
--db ssl:10.30.38.55:6641,ssl:10.30.38.17:6641,ssl:10.30.38.149:6641 
-c /etc/ovn/cert_host -p /etc/ovn/key_host -C /etc/ovn/ovn-central.crt --wait=sb create dns 
external_ids:lxd_switch=lxd-net2-ls-int 
external_ids:lxd_switch_port=lxd-net2-instance-21c96e7b-ab51-47d7-8710-ea5671773bf7-eth0: 
exit status 1 (ovn-nbctl: ssl:10.30.38.55:6641,ssl:10.30.38.17:6641,ssl:10.30.38.149:6641: database connection failed ())

I can confirm that LXD is indeed pointing to the correct ovn. In fact this can happen randomly with sequential lxc launch calls without changing anything else.

Strikes me as some kind of race in LXD or MicroOVN.

Potentially related to this, here is another error that showed up when launching instances:

lxc launch ubuntu:22.04 -s local -n default c2
Creating c2
Starting c2
Error: Failed to start device "eth0": Failed setting up OVN port: Failed setting DNS for "c2.lxd": FAILED ON micro04: Failed to run: ovn-nbctl --timeout=60 --db ssl:10.190.3.180:6641,ssl:10.190.3.181:6641,ssl:10.190.3.15:6641 -c /etc/ovn/cert_host -p /etc/ovn/key_host -C /etc/ovn/ovn-central.crt --wait=sb set dns a611cf89-86d8-40ea-9e24-679175faeae7
ae61623b-541d-4cf0-a7cf-9e29247fc5c3 external_ids:lxd_switch=lxd-net2-ls-int external_ids:lxd_switch_port=lxd-net2-instance-38714727-b226-49c3-b7ee-4e21093eca7e-eth0 records={"c2.lxd"="10.178.249.3 fd42:aa24:1c36:570d:216:3eff:fe8c:56d0"}: exit status 1 (ovn-nbctl: no row "a611cf89-86d8-40ea-9e24-679175faeae7
ae61623b-541d-4cf0-a7cf-9e29247fc5c3" in table DNS)

I slightly modified the error message to report FAILED ON {name of peer} for the system that actually runs the ovn-nbctl command. In this case it appears to be micro04, which is a node that is not among the 3 peers passed to the --db flag here.

@masnax did this only show up when repeatedly creating, deleting, creating the same named instance?

The reason I ask is because this error comes from here:

https://github.com/canonical/lxd/blob/f0d733c3dbcaf5c454617e6a8bbc7dd50fee27ab/lxd/network/openvswitch/ovn.go#L1279-L1317

LXD first queries the dns table using the find command, which gets the record a611cf89-86d8-40ea-9e24-679175faeae7 and then tries to modify it with the set command which fails no row a611cf89-86d8-40ea-9e24-679175faeae7.

And yet it found the row just moments before. Weird.

As OVN is using a clustered database LXD uses the --wait=sb flag which according to the manual says:

With --wait=sb, before ovn-nbctl exits, it waits for ovn-northd to bring the southbound database up-to-date with the northbound database updates.

Which is as consistent as we can be without potentially blocking indefinitely if one of the cluster members is down.

Also, the ovn-nbctl man page states:

--leader-only
--no-leader-only
                   By default, or with --leader-only, when the database server is a clustered database, ovn-nbctl will avoid servers other than the cluster leader. This ensures that any data that ovn-nbctl
                   reads  and  reports  is  up-to-date.  With --no-leader-only, ovn-nbctl will use any server in the cluster, which means that for read-only transactions it can report and act on stale data
                   (transactions that modify the database are always serialized even with --no-leader-only). Refer to Understanding Cluster Consistency in ovsdb(7) for more information.

So --leader-only is the default that LXD will be using, so again, OVN should be giving us consistent results.

So its very odd that we query for a row, find it, and then try to update it, only for it to fail saying is not there.
Unless there is an ongoing delete of the record going on somehow?

@tomponline Nope, this was a sequential launch, 3 seconds after launching c1. Only quirk is that I ran this test on the PS5 system in a nested VM, which are a bit slow.

@tomponline would you rather see this issue over at the LXD repo?

@roosterfish is it still an issue?

@roosterfish is it still an issue?

I am not able to reproduce it. Then let's add the question label as long as we don't have concrete reproducer steps and the issue can stay here.

We use "incomplete" on lxd so let's be consistent with that

Oh we don't have that yet I'll add it

I'm not able to reproduce this anymore with the huge test of spawning tons of containers, so I think it's safe to close this one.