canonical/microcloud

Micro OVN networking issues

benoitjpnet opened this issue · 5 comments

Following cluster:

root@mc10:~# lxc cluster ls
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME |            URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc10 | https://192.168.1.10:8443 | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|      |                           | database        |              |                |             |        |                   |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc11 | https://192.168.1.11:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc12 | https://192.168.1.12:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
root@mc10:~# 

I start one container c1:

lxc launch ubuntu:22.04 c1 --target mc10

I try a IPv6 ping:

root@mc10:~# lxc exec c1 -- ping gtw6.benoit.jp.net -c5
PING gtw6.benoit.jp.net(gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a)) 56 data bytes
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=1 ttl=51 time=7.25 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=2 ttl=51 time=6.79 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=3 ttl=51 time=7.17 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=4 ttl=51 time=6.74 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=5 ttl=51 time=6.88 ms

--- gtw6.benoit.jp.net ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 6.743/6.966/7.253/0.207 ms
root@mc10:~# 

It works

I move to host mc11:

lxc stop c1 && lxc move c1 --target mc11 && lxc start c1

I try a IPv6 ping:

root@mc10:~# lxc exec c1 -- ping gtw6.benoit.jp.net -c5
PING gtw6.benoit.jp.net(gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a)) 56 data bytes
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=1 ttl=51 time=8.62 ms

--- gtw6.benoit.jp.net ping statistics ---
5 packets transmitted, 1 received, 80% packet loss, time 4067ms
rtt min/avg/max/mdev = 8.624/8.624/8.624/0.000 ms

Issue: Only one ping has an answer, all others will be loss.

I move to host mc12:

lxc stop c1 && lxc move c1 --target mc12 && lxc start c1

I try a IPv6 ping:

root@mc10:~# lxc exec c1 -- ping gtw6.benoit.jp.net -c5
PING gtw6.benoit.jp.net(gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a)) 56 data bytes
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=1 ttl=51 time=15.1 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=2 ttl=51 time=7.57 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=3 ttl=51 time=6.75 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=4 ttl=51 time=6.31 ms
64 bytes from gtw.benoit.jp.net (2001:19f0:7001:3c3f:5400:4ff:fe2b:a59a): icmp_seq=5 ttl=51 time=6.95 ms

--- gtw6.benoit.jp.net ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 6.314/8.541/15.122/3.315 ms
root@mc10:~# 

It works

Conclusion: It seems OVN setup on mc11 have some issues. Alas I have no experience at all with OVN to debug it.

EDIT: Same issue with IPv4.

It seems that the problem disappeared after rebooting the three nodes. Closing for now.

I am having issues again. This time for local network.

In a container or a VM if I ping my upstream gateway, I will get only one ping answer.

root@mc10:~# lxc shell v1
root@v1:~# ping 192.168.1.1 -c2
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=1 ttl=63 time=2.92 ms

--- 192.168.1.1 ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1001ms
rtt min/avg/max/mdev = 2.920/2.920/2.920/0.000 ms

On the active chassis, I can see that the sequence 1 packet first leave the VM as 10.121.231.8, then SNATED as external IP 192.168.1.20, then answer is coming back to 192.168.1.20, SNATED to 10.121.231.8. So everything make sense and works.
But for sequence 2 packet, you can see the packet is just stuck at tap6f19bcd4 P IP 10.121.231.8 > 192.168.1.1, there are not SNAT being done by the OVN router.

10:54:31.144418 tap6f19bcd4 P   IP 10.121.231.8 > 192.168.1.1: ICMP echo request, id 17, seq 1, length 64
10:54:31.145623 enp2s0 Out IP 192.168.1.20 > 192.168.1.1: ICMP echo request, id 17, seq 1, length 64
10:54:31.146062 enp2s0 P   IP 192.168.1.1 > 192.168.1.20: ICMP echo reply, id 17, seq 1, length 64
10:54:31.147009 tap6f19bcd4 Out IP 192.168.1.1 > 10.121.231.8: ICMP echo reply, id 17, seq 1, length 64
10:54:32.145718 tap6f19bcd4 P   IP 10.121.231.8 > 192.168.1.1: ICMP echo request, id 17, seq 2, length 64
microovn.ovn-nbctl show
switch 3a4d49f4-a505-4817-8cb5-83f382c4cee7 (lxd-net2-ls-int)
    port lxd-net2-instance-25e871a1-8c3e-495e-ac36-ddee696d0995-eth0
        addresses: ["00:16:3e:03:46:c8 dynamic"]
    port lxd-net2-instance-731050ea-b6b3-4a94-a452-7913d8f0e2d7-eth0
        addresses: ["00:16:3e:e8:db:86 dynamic"]
    port lxd-net2-instance-4c640657-a7f3-439b-a0bd-ad3aabbc638d-eth0
        addresses: ["00:16:3e:f6:32:52 dynamic"]
    port lxd-net2-instance-07d158de-64a2-49c0-b23d-758eb192278d-eth0
        addresses: ["00:16:3e:7a:73:d3 dynamic"]
    port lxd-net2-ls-int-lsp-router
        type: router
        router-port: lxd-net2-lr-lrp-int
    port lxd-net2-instance-937b4846-9c69-4294-ae8f-605b598be840-eth0
        addresses: ["00:16:3e:c9:89:a3 dynamic"]
    port lxd-net2-instance-f6747b61-e223-43ae-b2ca-132c370acd6b-eth0
        addresses: ["00:16:3e:95:65:f6 dynamic"]
    port lxd-net2-instance-3d3f92dd-8df8-442d-a064-5fa4c48b77da-eth0
        addresses: ["00:16:3e:d5:8f:90 dynamic"]
    port lxd-net2-instance-346906da-3f0e-488a-8ad8-2e888e5f1438-eth0
        addresses: ["00:16:3e:e5:f4:2c dynamic"]
    port lxd-net2-instance-7845d47a-9adb-4d83-a73a-971fffa770f2-eth0
        addresses: ["00:16:3e:5d:db:e3 dynamic"]
    port lxd-net2-instance-aef52f75-4505-44fa-ac00-7fe4e951bc09-eth0
        addresses: ["00:16:3e:c3:43:0b dynamic"]
switch 1c44bdf2-0ee6-4da8-bec4-f2547b417e3a (lxd-net2-ls-ext)
    port lxd-net2-ls-ext-lsp-router
        type: router
        router-port: lxd-net2-lr-lrp-ext
    port lxd-net2-ls-ext-lsp-provider
        type: localnet
        addresses: ["unknown"]
router 295793b3-03cf-4686-84c2-4b8e10c8d15d (lxd-net2-lr)
    port lxd-net2-lr-lrp-int
        mac: "00:16:3e:c3:25:e2"
        networks: ["10.121.231.1/24", "fd42:4a26:3578:a318::1/64"]
    port lxd-net2-lr-lrp-ext
        mac: "00:16:3e:c3:25:e2"
        networks: ["192.168.1.20/24", "2001:f71:2500:2d00:216:3eff:fec3:25e2/64"]
    nat 78514acb-b7da-4a5c-9006-8c6b6c9be7f3
        external ip: "192.168.1.20"
        logical ip: "10.121.231.0/24"
        type: "snat"
    nat e5afa809-c358-4e7e-aafb-db6f24d796f5
        external ip: "2001:f71:2500:2d00:216:3eff:fec3:25e2"
        logical ip: "fd42:4a26:3578:a318::/64"
        type: "snat"

I have no idea on how to debug this really weird issue. And even doing reboot, it's not working anymore.
This make my microcloud cluster unusable.

Please provide output of snap list on each member.

root@mc10:~# snap list
Name        Version                 Rev    Tracking       Publisher   Notes
core22      20240111                1122   latest/stable  canonical✓  base
lxd         5.20-f3dd836            27049  latest/stable  canonical✓  in-cohort,held
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort,held
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort,held
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort,held
snapd       2.61.2                  21184  latest/stable  canonical✓  snapd
root@mc11:~# snap list
Name        Version                 Rev    Tracking       Publisher   Notes
core22      20240111                1122   latest/stable  canonical✓  base
lxd         5.20-f3dd836            27049  latest/stable  canonical✓  in-cohort,held
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort,held
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort,held
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort,held
snapd       2.61.2                  21184  latest/stable  canonical✓  snapd
root@mc12:~# snap list
Name        Version                 Rev    Tracking       Publisher   Notes
core22      20240111                1122   latest/stable  canonical✓  base
lxd         5.20-f3dd836            27049  latest/stable  canonical✓  in-cohort,held
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort,held
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort,held
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort,held
snapd       2.61.2                  21184  latest/stable  canonical✓  snapd

Alright problem went away by itself again... Weird, but closing for now.