Micro OVN networking issues
benoitjpnet opened this issue · 5 comments
Following cluster:
root@mc10:~# lxc cluster ls
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc10 | https://192.168.1.10:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc11 | https://192.168.1.11:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc12 | https://192.168.1.12:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
root@mc10:~#
I start one container c1
:
lxc launch ubuntu:22.04 c1 --target mc10
I try a IPv6 ping:
root@mc10:~# lxc exec c1 -- ping gtw6.benoit.jp.net -c5
PING gtw6.benoit.jp.net(gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a)) 56 data bytes
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=1 ttl=51 time=7.25 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=2 ttl=51 time=6.79 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=3 ttl=51 time=7.17 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=4 ttl=51 time=6.74 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=5 ttl=51 time=6.88 ms
--- gtw6.benoit.jp.net ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 6.743/6.966/7.253/0.207 ms
root@mc10:~#
It works
I move to host mc11
:
lxc stop c1 && lxc move c1 --target mc11 && lxc start c1
I try a IPv6 ping:
root@mc10:~# lxc exec c1 -- ping gtw6.benoit.jp.net -c5
PING gtw6.benoit.jp.net(gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a)) 56 data bytes
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=1 ttl=51 time=8.62 ms
--- gtw6.benoit.jp.net ping statistics ---
5 packets transmitted, 1 received, 80% packet loss, time 4067ms
rtt min/avg/max/mdev = 8.624/8.624/8.624/0.000 ms
Issue: Only one ping has an answer, all others will be loss.
I move to host mc12
:
lxc stop c1 && lxc move c1 --target mc12 && lxc start c1
I try a IPv6 ping:
root@mc10:~# lxc exec c1 -- ping gtw6.benoit.jp.net -c5
PING gtw6.benoit.jp.net(gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a)) 56 data bytes
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=1 ttl=51 time=15.1 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=2 ttl=51 time=7.57 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=3 ttl=51 time=6.75 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=4 ttl=51 time=6.31 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=5 ttl=51 time=6.95 ms
--- gtw6.benoit.jp.net ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 6.314/8.541/15.122/3.315 ms
root@mc10:~#
It works
Conclusion: It seems OVN setup on mc11 have some issues. Alas I have no experience at all with OVN to debug it.
EDIT: Same issue with IPv4.
It seems that the problem disappeared after rebooting the three nodes. Closing for now.
I am having issues again. This time for local network.
In a container or a VM if I ping my upstream gateway, I will get only one ping answer.
root@mc10:~# lxc shell v1
root@v1:~# ping 192.168.1.1 -c2
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=1 ttl=63 time=2.92 ms
--- 192.168.1.1 ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1001ms
rtt min/avg/max/mdev = 2.920/2.920/2.920/0.000 ms
On the active chassis, I can see that the sequence 1 packet first leave the VM as 10.121.231.8
, then SNATED as external IP 192.168.1.20
, then answer is coming back to 192.168.1.20
, SNATED to 10.121.231.8
. So everything make sense and works.
But for sequence 2 packet, you can see the packet is just stuck at tap6f19bcd4 P IP 10.121.231.8 > 192.168.1.1
, there are not SNAT being done by the OVN router.
10:54:31.144418 tap6f19bcd4 P IP 10.121.231.8 > 192.168.1.1: ICMP echo request, id 17, seq 1, length 64
10:54:31.145623 enp2s0 Out IP 192.168.1.20 > 192.168.1.1: ICMP echo request, id 17, seq 1, length 64
10:54:31.146062 enp2s0 P IP 192.168.1.1 > 192.168.1.20: ICMP echo reply, id 17, seq 1, length 64
10:54:31.147009 tap6f19bcd4 Out IP 192.168.1.1 > 10.121.231.8: ICMP echo reply, id 17, seq 1, length 64
10:54:32.145718 tap6f19bcd4 P IP 10.121.231.8 > 192.168.1.1: ICMP echo request, id 17, seq 2, length 64
microovn.ovn-nbctl show
switch 3a4d49f4-a505-4817-8cb5-83f382c4cee7 (lxd-net2-ls-int)
port lxd-net2-instance-25e871a1-8c3e-495e-ac36-ddee696d0995-eth0
addresses: ["00:16:3e:03:46:c8 dynamic"]
port lxd-net2-instance-731050ea-b6b3-4a94-a452-7913d8f0e2d7-eth0
addresses: ["00:16:3e:e8:db:86 dynamic"]
port lxd-net2-instance-4c640657-a7f3-439b-a0bd-ad3aabbc638d-eth0
addresses: ["00:16:3e:f6:32:52 dynamic"]
port lxd-net2-instance-07d158de-64a2-49c0-b23d-758eb192278d-eth0
addresses: ["00:16:3e:7a:73:d3 dynamic"]
port lxd-net2-ls-int-lsp-router
type: router
router-port: lxd-net2-lr-lrp-int
port lxd-net2-instance-937b4846-9c69-4294-ae8f-605b598be840-eth0
addresses: ["00:16:3e:c9:89:a3 dynamic"]
port lxd-net2-instance-f6747b61-e223-43ae-b2ca-132c370acd6b-eth0
addresses: ["00:16:3e:95:65:f6 dynamic"]
port lxd-net2-instance-3d3f92dd-8df8-442d-a064-5fa4c48b77da-eth0
addresses: ["00:16:3e:d5:8f:90 dynamic"]
port lxd-net2-instance-346906da-3f0e-488a-8ad8-2e888e5f1438-eth0
addresses: ["00:16:3e:e5:f4:2c dynamic"]
port lxd-net2-instance-7845d47a-9adb-4d83-a73a-971fffa770f2-eth0
addresses: ["00:16:3e:5d:db:e3 dynamic"]
port lxd-net2-instance-aef52f75-4505-44fa-ac00-7fe4e951bc09-eth0
addresses: ["00:16:3e:c3:43:0b dynamic"]
switch 1c44bdf2-0ee6-4da8-bec4-f2547b417e3a (lxd-net2-ls-ext)
port lxd-net2-ls-ext-lsp-router
type: router
router-port: lxd-net2-lr-lrp-ext
port lxd-net2-ls-ext-lsp-provider
type: localnet
addresses: ["unknown"]
router 295793b3-03cf-4686-84c2-4b8e10c8d15d (lxd-net2-lr)
port lxd-net2-lr-lrp-int
mac: "00:16:3e:c3:25:e2"
networks: ["10.121.231.1/24", "fd42:4a26:3578:a318::1/64"]
port lxd-net2-lr-lrp-ext
mac: "00:16:3e:c3:25:e2"
networks: ["192.168.1.20/24", "2001:f71:2500:2d00:216:3eff:fec3:25e2/64"]
nat 78514acb-b7da-4a5c-9006-8c6b6c9be7f3
external ip: "192.168.1.20"
logical ip: "10.121.231.0/24"
type: "snat"
nat e5afa809-c358-4e7e-aafb-db6f24d796f5
external ip: "2001:f71:2500:2d00:216:3eff:fec3:25e2"
logical ip: "fd42:4a26:3578:a318::/64"
type: "snat"
I have no idea on how to debug this really weird issue. And even doing reboot, it's not working anymore.
This make my microcloud cluster unusable.
Please provide output of snap list
on each member.
root@mc10:~# snap list
Name Version Rev Tracking Publisher Notes
core22 20240111 1122 latest/stable canonical✓ base
lxd 5.20-f3dd836 27049 latest/stable canonical✓ in-cohort,held
microceph 0+git.4a608fc 793 quincy/stable canonical✓ in-cohort,held
microcloud 1.1-04a1c49 734 latest/stable canonical✓ in-cohort,held
microovn 22.03.3+snap0e23a0e4f5 395 22.03/stable canonical✓ in-cohort,held
snapd 2.61.2 21184 latest/stable canonical✓ snapd
root@mc11:~# snap list
Name Version Rev Tracking Publisher Notes
core22 20240111 1122 latest/stable canonical✓ base
lxd 5.20-f3dd836 27049 latest/stable canonical✓ in-cohort,held
microceph 0+git.4a608fc 793 quincy/stable canonical✓ in-cohort,held
microcloud 1.1-04a1c49 734 latest/stable canonical✓ in-cohort,held
microovn 22.03.3+snap0e23a0e4f5 395 22.03/stable canonical✓ in-cohort,held
snapd 2.61.2 21184 latest/stable canonical✓ snapd
root@mc12:~# snap list
Name Version Rev Tracking Publisher Notes
core22 20240111 1122 latest/stable canonical✓ base
lxd 5.20-f3dd836 27049 latest/stable canonical✓ in-cohort,held
microceph 0+git.4a608fc 793 quincy/stable canonical✓ in-cohort,held
microcloud 1.1-04a1c49 734 latest/stable canonical✓ in-cohort,held
microovn 22.03.3+snap0e23a0e4f5 395 22.03/stable canonical✓ in-cohort,held
snapd 2.61.2 21184 latest/stable canonical✓ snapd
Alright problem went away by itself again... Weird, but closing for now.