Nordix/xcluster

ECMP does not work with kernels later than Linux-5.4.35

Closed this issue ยท 15 comments

For ECMP packets in the same "flow" (same addresses (L3) or same addresses+ports (L4) should be directed to the same target, but it seems like stray packets are directed to another target. This, for instance, causes RST to be sent on TCP connections.

Since nobody has reported this as a kernel bug, it may be an xcluster problem, but I can't really see how. I think I have reported it to the kernel mailing list and/or bugzilla, but I really don't remember.

Reproduce

Remove the hard-coded kver setting, or apply #40.

In ovl/load-balancer:

cdo load-balancer
__nrouters=1 ./load-balancer.sh test start_ecmp > $log
# On vm-201 (the one router vm)
cat /proc/sys/net/ipv4/fib_multipath_hash_policy   # (0 = L3)
tcpdump -eni eth1 -w /tmp/vm-201.pcap tcp port 5001   # (optionally)
# On vm-221 (the tester)
mconnect -address 10.0.0.0:5001 -nconn 10   # works
mconnect -address 10.0.0.0:5001 -nconn 10 -srccidr 50.0.0.0/16 # fails
# On vm 201 (the one router vm)
echo 1 > /proc/sys/net/ipv4/fib_multipath_hash_policy  # L4 hash
mconnect -address 10.0.0.0:5001 -nconn 10  # NOTE, this works on vm-201 with direct access, but ...
# On vm-221
mconnect -address 10.0.0.0:5001 -nconn 10  # ... it fails from vm-221 when packets are forwarded!

Kernel doc

fib_multipath_hash_policy - INTEGER
        Controls which hash policy to use for multipath routes. Only valid
        for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled.

        Default: 0 (Layer 3)

        Possible values:

        - 0 - Layer 3
        - 1 - Layer 4
        - 2 - Layer 3 or inner Layer 3 if present
        - 3 - Custom multipath hash. Fields used for multipath hash calculation
          are determined by fib_multipath_hash_fields sysctl
fib_multipath_hash_fields - UNSIGNED INTEGER
        When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
        fields used for multipath hash calculation are determined by this
        sysctl.

        This value is a bitmask which enables various fields for multipath hash
        calculation.

        Possible fields are:

        ====== ============================
        0x0001 Source IP address
        0x0002 Destination IP address
        0x0004 IP protocol
        0x0008 Unused (Flow Label)
        0x0010 Source port
        0x0020 Destination port
        0x0040 Inner source IP address
        0x0080 Inner destination IP address
        0x0100 Inner IP protocol
        0x0200 Inner Flow Label
        0x0400 Inner source port
        0x0800 Inner destination port
        ====== ============================

        Default: 0x0007 (source IP, destination IP and IP protocol)

This trace vm-201.pcap.gz has 3 failing reads. For instance access from 50.0.61.223:35913 is directed to vm-004, but in packet 18 an ACK on that flow is directed to vm-001, which responds with a RST.

When testing manually, e.g. with "nc 10.0.0.0:5001" from vm-221, the problem doesn't seem to occur. Also, when making just a few connections, there seem to be some sort of "affinity" to one target.

# On vm-201
echo 1 > /proc/sys/net/ipv4/fib_multipath_hash_policy
# On vm-221
vm-221 ~ # mconnect -address 10.0.0.0:5001 -nconn 4
Failed connects; 0
Failed reads; 0
vm-003 4

IMO, the distribution should be more random. When a distribution actually occurs, the chance of encountering the ECMP problem is high.

Seems an interesting problem, will take a look at this.

I will propose a solution to netdev for this, but I need a way to run bash for kselftest. Any suggestions what is the best way to get bash on xcluster vm?

In an ovl/tar file:

$XCLUSTER install_prog --dest=$tmp bash

I havn't tested, but I suggest an ovl/bash, so you can include it when needed

I need /bin/sh to point to bash and start with bash for some tests in kselftest to work.

Here is the bash ovl I just pushed: 93ee791
I built from source for now, let me know if you think install_prog is better, then I can remove the source build?

Excellent work ๐Ÿ‘ I am though surprised that it has gone undetected for so long.

Thanks, writing the selftests was the hard part.

Perhaps, most people use a home written userspace load balancer than linux ecmp.

Time to upgrade to linux-6.5.4 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.5.y
I will do that later today.