vswitchd connection to local ovsdb hangs and drops after upgrade to 3.3.0
Closed this issue · 2 comments
Hi,
Recently upgraded from 2.17 to 3.3.0, It seems vswitchd connection to local ovsdb drops, and it is blocked from adding flows, for example when you live migrate a VM onto a chassis, vswitchd logs(VM has two ports):
2024-10-13T14:21:53.880Z|04678|bridge|INFO|bridge br-int: added interface tap8100c9fa-f1 on port 799
2024-10-13T14:22:01.181Z|04679|bridge|INFO|bridge br-int: added interface tap5a041216-53 on port 800
2024-10-13T14:22:04.132Z|04680|connmgr|INFO|br-int<->unix#33540: 559 flow_mods 10 s ago (507 adds, 52 deletes)
2024-10-13T14:23:19.113Z|04681|rconn|ERR|br-int<->unix#33541: no response to inactivity probe after 60 seconds, disconnecting
2024-10-13T14:23:54.228Z|04682|rconn|ERR|br-int<->unix#33540: no response to inactivity probe after 60 seconds, disconnecting
2024-10-13T14:24:22.426Z|04683|connmgr|INFO|br-int<->unix#33575: 57976 flow_mods 10 s ago (57580 adds, 396 deletes)
After 2 - 3 minutes not changing anything it overloads and does a 57K change!
Normal logs on vswitchd are like this(another chassis which does not show the symptoms):
2024-10-13T14:01:49.945Z|04458|bridge|INFO|bridge br-int: added interface tap3deb1c91-f9 on port 888
2024-10-13T14:01:56.805Z|04459|bridge|INFO|bridge br-int: added interface tapad3287d4-21 on port 889
2024-10-13T14:02:25.415Z|04460|connmgr|INFO|br-int<->unix#20: 560 flow_mods in the 4 s starting 35 s ago (508 adds, 52 deletes)
2024-10-13T14:03:44.898Z|04461|connmgr|INFO|br-int<->unix#20: 3684 flow_mods in the 2 s starting 10 s ago (3231 adds, 451 deletes, 2 modifications)
2024-10-13T14:05:18.585Z|04462|connmgr|INFO|br-int<->unix#20: 387 flow_mods 10 s ago (387 adds)
It has still a wide gap in seconds but I guess it is ok and I have a live migrate ping packet loss span from 1s to 4s which is acceptable.
Also on ovsdb on info vlog it does not show anything at all, enabling dbg on ovsdb shows some logs:
2024-10-13T14:21:53.855Z|00526|reconnect|DBG|unix#50485: connecting...
2024-10-13T14:21:53.855Z|00527|reconnect|DBG|unix#50485: entering CONNECTING
2024-10-13T14:21:53.855Z|00528|reconnect|DBG|unix#50485: connected
2024-10-13T14:21:53.855Z|00529|reconnect|DBG|unix#50485: entering ACTIVE
2024-10-13T14:21:53.855Z|00530|jsonrpc|DBG|unix#50485: received request, method="get_schema", params=["_Server"], id=1
...
2024-10-13T14:21:53.857Z|00536|jsonrpc|DBG|unix#50485: received request, method="monitor_cond_since", params=["Open_vSwitch",["monid","Open_vSwitch"],{"Bridge":[{"columns":["controller","fail_mode","name
","ports"]}],"Controller":[{"columns":[]}],"Interface":[{"columns":["error","external_ids","name","ofport"]}],"Open_vSwitch":[{"columns":["bridges","cur_cfg"]}],"Port":[{"columns":["fake_bridge","interfaces","name","tag"]}]},"00000000-0000-0000-0000-000000000000"], id=3
...
2024-10-13T14:21:53.863Z|00539|jsonrpc|DBG|unix#50485: received request, method="set_db_change_aware", params=[true], id=4
2024-10-13T14:21:53.863Z|00540|jsonrpc|DBG|unix#50485: send reply, result={}, id=4
...
logs are very long and I am attaching a more detailed file, no visible bug or error.
I was thinking maybe a bug fix was added to 3.3.2 but reading commit messages I did not see anything related:
openvswitch/ovs@v3.3.0...v3.3.2
Also output of Open_vswitch table:
# ovs-vsctl list open
_uuid : dfc468f1-57ac-4c8d-aa3a-eb933ee8592b
bridges : [80d9cd4b-7fec-4c11-8260-1938774e4399, ee4f565a-24a4-4be4-954f-f2ccaa40831e]
cur_cfg : 32796
datapath_types : [netdev, system]
datapaths : {system=722ede64-360d-4bbe-81f5-f26a2136e159}
db_version : []
dpdk_initialized : false
dpdk_version : none
external_ids : {ovn-bridge-mappings="physnet1:br-ex", ovn-chassis-mac-mappings="physnet1:52:54:00:1b:84:cc", ovn-encap-ip="****", ovn-encap-type=geneve, ovn-monitor-all="false", ovn-ofctrl-wait-before-clear="8000", ovn-openflow-probe-interval="60", ovn-remote="*****", ovn-remote-probe-interval="60000", system-id="6389c8c0-d030-5346-aa41-cd6e3ff3545f"}
iface_types : [afxdp, afxdp-nonpmd, bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan]
manager_options : []
next_cfg : 32796
other_config : {ovn-chassis-idx-6389c8c0-d030-5346-aa41-cd6e3ff3545f="", vlan-limit="0"}
ovs_version : []
ssl : []
statistics : {}
system_type : []
system_version : []
[logovsdb.txt](https://github.com/user-attachments/files/17356107/logovsdb.txt)
This is not repeating everywhere but on one production I have 50% of my live migrations hitting this issue, could not recreate it on stage!
Do you have any guess what is wrong and what is the cause maybe?
br-int<->unix
is an OpenFlow connection, not the database connection. Looks like your ovn-controller
doesn't respond for a long time to OpenFlow probes. You need to check what it is doing so long.
Yes it is in ovn controller:
2024-10-15T09:18:45.469Z|05297|inc_proc_eng|INFO|node: physical_flow_output, handler for input if_status_mgr took 94629ms
2024-10-15T09:18:45.711Z|05298|memory_trim|INFO|Detected inactivity (last active 96260 ms ago): trimming memory
2024-10-15T09:18:45.713Z|05299|timeval|WARN|Unreasonably long 94879ms poll interval (93459ms user, 1292ms system)
Thanks Ilya, I will close this issue