newsnowlabs/docker-ingress-routing-daemon

Routing daemon only works on specific machines.

Vaults opened this issue · 12 comments

This issue is quite puzzling. I've ran the daemon successfully on a swarm with three manager nodes. However, when a container runs on one node, the container is unreachable and times out. For a web server my browser tells me "This site can’t be reached - Y took too long to respond."

My architecture consists of the following machines:

  • Y - A drained manager node that does not accept containers
  • A - A regular manager node
  • U - Another regular manager node

I have not tested running services on Y. Services consistently do not work when A is the only one hosting those specific containers. U always seems to work. I have had the same issue with v2.5. The daemon does not give any meaningful errors or messages that may have to do with this issue.

I am running the service with command line ./docker-ingress-routing-daemon-v3.sh --install --services "dashboard" --tcp-ports "42069" --ingress-gateway-ips "10.255.0.49 10.255.0.179 10.255.28.151" --no-performance. on all three machines. I've triple checked the gateways.

I've tried:

  • Manually checking TOS values and going past all set IP rules that the script does (in v2.5). I could not find differences between A and U.
  • Rebooting docker services and machines and restarting services in multiple ways
  • Checking the machine/ingress/container iptables for rules that might interfere. I could not find any differences as well.

Uninstalling the daemon on all machines does have the expected behavior of everything working as normal.

Hi @Vaults. Apologies for slow reponse, github is not notifying me of new issues for some reason. Please @ me in issues/comments thanks.

What I suspect could be problematic here, is that your ingress IPS are not on the same /24 subnet, as is the default. Have you configured a /16 subnet for your ingress network? I am unsure how the daemon will perform on a subnet larger than a /24, since the TOS used for each load balancer node is currently derived from the final byte of the ingress network IP, which could (although doesn't seem to be in your case) the same for two different IPs on any subnet larger than a /24.

For avoidance of doubt, please could you rerun ./docker-ingress-routing-daemon (no arguments) on each node, and paste the output of the lines beginning with !!!.

And could you also check all the permutations of (a) load balancer node; and (b) destination container node; and advise which work and which do not? i.e. If you have three nodes - A, B, C - each running one service container - then in practice you could make requests to either A, B, or C as load balancer, with IPVS on that node forwarding the request a service container on A, B, or C.

It will be helpful to establish precisely which permutations work/fail.

FYI: In our production environment, we also do not currently run service containers on the nodes used as load balancers. So if we call our load balancer nodes X & Y, and our worker nodes A, B, C, D, E, F, then all of the following are valid permutations:

  • Incoming request for X => forwards to A, B, C, D, E or F
  • Incoming request for Y => forwards to A, B, C, D, E or F

P.S. In my experience, whether a node is a manager node or not has no bearing on its handling of incoming service requests, and the forwarding of those requests. But if pre-existing firewall rules are not configured identically and properly, then it can impact the visibility of containers on certain nodes for certain load balancer nodes.

@struanb Thanks for helping me out. Will @ you from now on.

I haven't specifically configured a /16 subnet, Docker has to have chosen this itself (I see that my ingress is /16). I figured it'd still work because all the last IP values were unique. It just so happens that the unreachable machine is the one on a different /24 subnet. I'm going to try reconfiguring that first and retesting. If that does not work, I'll continue with all the suggestions you have put forward. I'll let you know as soon as I get into it.

@struanb
I've reconfigured the swarm to work on a /24 subnet. I found that reconfiguring ingress forces you to basically recreate your entire swarm :p Here are the outputs of me running the no argument command on each machine:

A:

!!! Ingress subnet: 10.255.0.0/24
!!! This node's ingress network IP: 10.255.0.15

U:

!!! Ingress subnet: 10.255.0.0/24
!!! This node's ingress network IP: 10.255.0.16

Y (load balancer):

!!! Ingress subnet: 10.255.0.0/24
!!! This node's ingress network IP: 10.255.0.26

On every machine I've run this command (verified with copypaste):
./docker-ingress-routing-daemon-v3.sh --install --services "dashboard" --tcp-ports "42069" --ingress-gateway-ips "10.255.0.15 10.255.0.16 10.255.0.26" --no-performance.

Here is a table of all permutations. Note Y is the load balancer. No errors were found in logging. I've attempted the connections multiple times to be sure in all columns.

Attempted request 1 container runs only on U 1 container runs only on A 3 containers on A, 3 containers on U
Y:42069 Pass Fail Fails sometimes
A:42069 Pass Fail Fails sometimes
U:42069 Pass Fail Fails sometimes

HI @Vaults. Thanks for taking the trouble to do this. It seems we've eliminated the ingress subnet being the cause of the issue.

And for avoidance of doubt, although you say Y is the load balancer, it also seems (from your attempted requests) that node A and U are also operating correctly as load balancers, receiving requests, and routing traffic to U (as they all should, since you specified the ingress IPs for all three nodes, Y, A and U, on the command line).

Therefore, it seems that there must be a configuration difference between A and U, that is preventing packets either reaching A or being returned correctly by A to the load balancer receiving the request. I'm not sure what this could be. Some things I would now try:

  1. Deinstall the docker-ingress-routing-daemon throughout. Shut down and recreate the service, running one container on A. Confirm that you get a Pass when requesting Y:42069.
  2. Reinstall the daemon on Y and A (we can ignore U), using the simpler command shown below. Again shut down and recreate the service, running one container on A. Confirm again you get a Fail when requesting Y:42069.
  3. Check the value of /proc/sys/net/ipv4/ip_forward, and uname -r on Y, A and U. I'm not sure if eithers matter, but could be revealing.
  4. Where <containerId> is the ID of the service container running on A, check what iptables and routing rules have been installed in the container namespace with: nsenter -n -t $(docker inspect -f '{{.State.Pid}}' <containerId>) iptables-save -t mangle and nsenter -n -t $(docker inspect -f '{{.State.Pid}}' <containerId>) ip rule show and nsenter -n -t $(docker inspect -f '{{.State.Pid}}' <containerId>) ip route show table 26
  5. Run tcpdump on A for the primary node interface. Make an incoming request to Y. Note whether you see only incoming, or also outgoing packets.

P.S. To simplify the setup now, we can launch with only Y's IP specified on the command line i.e. ./docker-ingress-routing-daemon-v3.sh --install --services "dashboard" --tcp-ports "42069" --ingress-gateway-ips "10.255.0.26" --no-performance. N.B. When this is run on A, A's load balancer iptables rules are still configured (it could be useful to provide a way to disable even this) but the additional routing rules needed on service containers launched on A (for returning traffic to the load balancers) will only be added in respect of Y; and then incoming requests to A:42069 and U:42069 should no longer succeed, but the outcomes for requests to Y:42069 should be unchanged.

Hi @struanb. Thank you for the debug pointers. I've followed all the steps but I have no intuition yet as where the problem can be. Below are the results of all the steps taken.


1 - Uninstalled daemon on all machines. Recreated service, forced on A. Connecting to Y:42069 gives a pass.


2 - Reinstalled daemon on Y and A with the given simpler command. Recreated service, single container forced on A. Consistent fail with over >10 retries.


3 -

Y A U
uname -r 4.19.66-v7+ 4.15.0-135-generic 4.19.0-14-amd64
cat /proc/sys/net/ipv4/ip_forward 1 1 1

4 - Confirmed 414d52c46bba to be the container (running on A) belonging to the service (dashboard).

2021-02-22.10:18:30.176422|A|14918| Container SERVICE=dashboard, ID=414d52c46bba117176f23f45ffb8e15be500013bc7ae9a1b3994d227b338036c, NID=15137 launched: ingress network interface eth0 found, so applying policy routes.

Ran the following command on A: nsenter -n -t $(docker inspect -f '{{.State.Pid}}' 414d52c46bba) iptables-save -t mangle

# Generated by iptables-save v1.6.1 on Mon Feb 22 10:24:52 2021
*mangle
:PREROUTING ACCEPT [42:2520]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A PREROUTING -m tos --tos 0x1a/0xff -j CONNMARK --set-xmark 0x1a/0xffffffff
-A OUTPUT -p tcp -j CONNMARK --restore-mark --nfmask 0xffffffff --ctmask 0xffffffff
COMMIT
# Completed on Mon Feb 22 10:24:52 202

Ran this command as well: nsenter -n -t $(docker inspect -f '{{.State.Pid}}' 414d52c46bba) ip rule show

0:  from all lookup local 
32700:  from 10.255.0.0/24 fwmark 0x1a lookup 26 
32766:  from all lookup main 
32767:  from all lookup default 

Finally: nsenter -n -t $(docker inspect -f '{{.State.Pid}}' 414d52c46bba) ip route show table 26

default via 10.255.0.26 dev eth0

5 - I confirmed 10.255.0.120 to be the ingress IP of the dashboard service container. Note that the following logs kept repeating so I picked the first five. Ran the following command on A:
nsenter -n -t $(docker inspect -f '{{.State.Pid}}' 414d52c46bba) tcpdump port 42069.

10:29:04.809105 IP <host ip A>.47204 > 10.255.0.120.42069: Flags [S], seq 1249003281, win 64240, options [mss 1460,sackOK,TS val 3443063060 ecr 0,nop,wscale 7], length 0
10:29:05.832600 IP <host ip A>.47204 > 10.255.0.120.42069: Flags [S], seq 1249003281, win 64240, options [mss 1460,sackOK,TS val 3443064085 ecr 0,nop,wscale 7], length 0
10:29:07.845435 IP <host ip A>.47204 > 10.255.0.120.42069: Flags [S], seq 1249003281, win 64240, options [mss 1460,sackOK,TS val 3443066101 ecr 0,nop,wscale 7], length 0
10:29:11.966917 IP <host ip A>.47204 > 10.255.0.120.42069: Flags [S], seq 1249003281, win 64240, options [mss 1460,sackOK,TS val 3443070229 ecr 0,nop,wscale 7], length 0

Ran this command as well on A: nsenter --net=/var/run/docker/netns/ingress_sbox tcpdump port 42069

10:32:14.265485 IP <host ip A>.48760 > 10.255.0.120.42069: Flags [S], seq 2397950671, win 64240, options [mss 1460,sackOK,TS val 3443252514 ecr 0,nop,wscale 7], length 0
10:32:15.275564 IP <host ip A>.48760 > 10.255.0.120.42069: Flags [S], seq 2397950671, win 64240, options [mss 1460,sackOK,TS val 3443253525 ecr 0,nop,wscale 7], length 0
10:32:17.293153 IP <host ip A>.48760 > 10.255.0.120.42069: Flags [S], seq 2397950671, win 64240, options [mss 1460,sackOK,TS val 3443255545 ecr 0,nop,wscale 7], length 0
10:32:21.411191 IP <host ip A>.48760 > 10.255.0.120.42069: Flags [S], seq 2397950671, win 64240, options [mss 1460,sackOK,TS val 3443259669 ecr 0,nop,wscale 7], length 0
10:32:29.594197 IP <host ip A>.48760 > 10.255.0.120.42069: Flags [S], seq 2397950671, win 64240, options [mss 1460,sackOK,TS val 3443267861 ecr 0,nop,wscale 7], length 0

Finally ran this command for extra info on Y: nsenter --net=/var/run/docker/netns/ingress_sbox tcpdump port 42069

11:35:15.187397 IP <IP of connecting machine>.50236 > 10.255.0.120.42069: Flags [S], seq 328693815, win 64240, options [mss 1460,sackOK,TS val 3443433394 ecr 0,nop,wscale 7], length 0
11:35:15.439114 IP <IP of connecting machine>.50238 > 10.255.0.120.42069: Flags [S], seq 2596679873, win 64240, options [mss 1460,sackOK,TS val 3443433646 ecr 0,nop,wscale 7], length 0
11:35:16.214555 IP <IP of connecting machine>.50236 > 10.255.0.120.42069: Flags [S], seq 328693815, win 64240, options [mss 1460,sackOK,TS val 3443434421 ecr 0,nop,wscale 7], length 0
11:35:16.470456 IP <IP of connecting machine>.50238 > 10.255.0.120.42069: Flags [S], seq 2596679873, win 64240, options [mss 1460,sackOK,TS val 3443434677 ecr 0,nop,wscale 7], length 0
11:35:18.230495 IP <IP of connecting machine>.50236 > 10.255.0.120.42069: Flags [S], seq 328693815, win 64240, options [mss 1460,sackOK,TS val 3443436437 ecr 0,nop,wscale 7], length 0

Thanks for doing this. As far as I can see, on a quick examination, the iptables rules, ip rule, and ip route setup within A's namespace look correct.

The tcpdump output, as you must have noticed, shows only incoming packets to the container's IP and no outgoing packets. This can't be correct.

Also, the first tcpdump reports A's host IP as the source IP. I'd expect to see your client IP (IP of connecting machine) there. At least that's what I see when I run this for a container in our production cluster.

In the second tcpdump, I'm not sure I would expect to see anything logged. Packets arising for incoming connections to Y should not (if I understand it, and according to my own tests just now) be seen in the ingress_sbox namespace on A, but only on Y.

In the third tcpdump, the results look correct.

How to explain this? It could be there is a rogue iptables SNAT rule on A, masquerading traffic arriving at A from Y. Can you run iptables-save -t nat to check for this in both default namespace (plain command), ingress_sbox namespace, and container's namespace?

@struanb To me it seems odd as well that there aren't any outgoing packets. I was looking with iptables -L back then and couldn't see anything that stood out. Maybe it would make more sense to you as my knowledge is probably lacking. I've run the following commands on A:

nsenter -n -t $(docker inspect -f '{{.State.Pid}}' 414d52c46bba) iptables-save -t nat

# Generated by iptables-save v1.6.1 on Mon Feb 22 13:09:53 2021
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [8:660]
:POSTROUTING ACCEPT [8:660]
:DOCKER_OUTPUT - [0:0]
:DOCKER_POSTROUTING - [0:0]
-A PREROUTING -d 10.255.0.120/32 -p tcp -m tcp --dport 42069 -j REDIRECT --to-ports 80
-A OUTPUT -d 127.0.0.11/32 -j DOCKER_OUTPUT
-A POSTROUTING -d 127.0.0.11/32 -j DOCKER_POSTROUTING
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:40919
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:37362
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p tcp -m tcp --sport 40919 -j SNAT --to-source :53
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p udp -m udp --sport 37362 -j SNAT --to-source :53
COMMIT

nsenter --net=/var/run/docker/netns/ingress_sbox iptables-save -t nat

# Generated by iptables-save v1.6.1 on Mon Feb 22 13:12:26 2021
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER_OUTPUT - [0:0]
:DOCKER_POSTROUTING - [0:0]
-A OUTPUT -d 127.0.0.11/32 -j DOCKER_OUTPUT
-A POSTROUTING -d 10.255.0.0/24 -p tcp -m multiport --dports 42069 -m ipvs --ipvs -j ACCEPT
-A POSTROUTING -d 127.0.0.11/32 -j DOCKER_POSTROUTING
-A POSTROUTING -d 10.255.0.0/24 -m ipvs --ipvs -j SNAT --to-source 10.255.0.15
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:38541
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:44268
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p tcp -m tcp --sport 38541 -j SNAT --to-source :53
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p udp -m udp --sport 44268 -j SNAT --to-source :53
COMMIT
# Completed on Mon Feb 22 13:12:26 2021

iptables-save -t nat

# Generated by iptables-save v1.6.1 on Mon Feb 22 13:13:18 2021
*nat
:PREROUTING ACCEPT [7584:870086]
:INPUT ACCEPT [5958:762900]
:OUTPUT ACCEPT [7192:955465]
:POSTROUTING ACCEPT [7195:955645]
:DOCKER - [0:0]
:DOCKER-INGRESS - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER-INGRESS
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m addrtype --dst-type LOCAL -j DOCKER-INGRESS
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -o docker_gwbridge -m addrtype --src-type LOCAL -j MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A DOCKER -i docker_gwbridge -j RETURN
-A DOCKER-INGRESS -p tcp -m tcp --dport 42069 -j DNAT --to-destination 172.18.0.2:42069
-A DOCKER-INGRESS -p tcp -m tcp --dport <other-port> -j DNAT --to-destination 172.18.0.2:<other-port> (this line was repeated for each service)
-A DOCKER-INGRESS -j RETURN
COMMIT
# Completed on Mon Feb 22 13:13:18 2021

I'm a bit stumped. Going back to your first tcpdump, how could a connection arriving at Y, that is forwarded by ipvs within the ingress namespace on Y (from 10.255.0.26) to the container on A (10.255.0.120), be seen within that container to have a source IP of <host ip A>? The two possibilities should be: (a) <ingress ip for Y> i.e. 10.255.0.26 (if daemon uninstalled); (b) <client's IP> (if daemon installed).

My theory was an iptables rule on A was setting the source IP to <host ip A>. Yet pretty much everything in your iptables rules checks out. There are minor differences between your rules and mine, which I suspect may be down to differences in docker version.

So I'm sorry I'm a bit stumped and unsure how to proceed without being able to hunt around on the keyboard. Remaining ideas:

  • I don't know if you might have some nftables firewall rules (while you're running legacy iptables) or vice-versa. That could produce odd effects. Can you reboot A to ensure you have freshly installed iptables rules, and retest? Ideally do this before any further investigations.
  • I don't know if you're running a similar version of dockerd on all your servers. I'm running 19.03.* and 20.10.3. Not sure it could explain this, but could you check?
  • Set up your cluster with Y as load balancer, and A & U running containers. Run the tcpdump commands above on U as well as A. Confirm that with daemon uninstalled you see connections from <ingress ip for Y> to <ingress IP for container>. Confirm that with the daemon installed you see connections from <client IP> to <ingress IP for container> at least on U. Compare U and A's tcpdumps. If they remain different, at least it backs up the theory there is something wrong with A.

@Vaults As it seems unclear this issue concerns this software - after all it does work with one of your nodes but not with another - I'm closing this issue for now. If there's a way I could inspect your setup more closely, I'd be happy to take a look at some point, feel free to email me.

@Vaults Please could you check out #4 and the patch? It seems possible this could be related to the issue you were having. Are your hosts all running the same, or different, kernel versions? I suspect the default value for rp_filter has changed between 4.9.0 (which we are running in production) and 4.19.

@struanb
Hi, my apologies for the flaky communication. I have some ongoing personal issues unfortunately.

I tested the patch and it seems to be working :o I've tested it on two services and it all works like a charm. My kernel versions were described in #3 (comment), they did differ indeed but not like you described.

Thank you again for the daemon and for the effort put into fixing the issues. :)

@Vaults That's great news. I'm sorry I didn't spot the kernel discrepancy and think about what this could mean sooner, but I'm glad we now have a fix in v3.1.0.

Hope everything good with you soon.