digidotcom/DCRabbit_10

router_add with existing ARP table entry not marked as ROUTER

Closed this issue · 7 comments

I had some uncharacteristic communication errors, and tracked my specific problem to an issue with adding a static router. In short, I was successfully adding a static router using IFS_ROUTER_SET, but later on the ARP entry for that router was flushed. It turns out that particular ip address was already in the ARP table, but not marked with the ATE_ROUTER_ENT flag. This happened because of an earlier speculative load. The router_add code found the entry, added it to the router table, and was quite happy. But alas, later on the entry got flushed and I lost connectivity through that router.

I'm thinking some sort of check for this inside the router_add function in ARP.LIB might do the trick? Or am I barking up the wrong tree here?

_arp_nodebug ATHandle router_add(longword ipaddr, word iface,
	longword subnet, longword mask, word flags)
...
   if (ath > 0) {
	   /* Let's check if it's marked as a router...might do stuff later if not */
	   	idx = ATH2INDEX(ath);
	   	/* We already know it's good...so just get to the flags */
	   	ate = _arp_data + idx;
	   	if (!(ate->flags & ATE_ROUTER_ENT)) {
	   		setToRouter = 1;
	   	}
	   // If in the ATE already, locate its RTE.
		for (i = 0, rte = _arp_gate_data; i < ARP_ROUTER_TABLE_SIZE; i++, rte++)
			if (rte->ath && ATH2INDEX(rte->ath) == ATH2INDEX(ath)) {
				if (!flags && (subnet != rte->u.preconfig.subnet || mask != rte->u.preconfig.mask))
					continue;	// Skip if preconfigured, but not exactly matching parameters
#ifdef ARP_VERBOSE
				printf("ARP: router_add %08lX - entry already in\n", ipaddr);
#endif
				if (!(rte->flags & (RTE_TRANSIENT | RTE_DHCP | RTE_REDIRECT))) {
					// Don't override preconfigured - these values set back in below.
					flags &= ~(RTE_TRANSIENT | RTE_DHCP | RTE_REDIRECT);
					subnet = rte->u.preconfig.subnet;
					mask = rte->u.preconfig.mask;
				}
				if (flags & RTE_REDIRECT && rte->flags & (RTE_TRANSIENT | RTE_DHCP))
					flags &= ~RTE_REDIRECT;	// Don't mark transients as redirects.
				goto _arp_set_entry;
			}
   }
...
arp_set_entry:
    if (setToRouter) {
        /* Need to change the flags and probably do other stuff */
    }

I didn't do any type of pull request for this, because perhaps I'm interpreting things incorrectly?

Good find. I think the following fix will take care of setting the flag when necessary. I'm out of the office for the holidays, so I can't test beyond verifying it compiles. If you can do some testing and report back, I'd greatly appreciate it.

diff --git a/Lib/Rabbit4000/tcpip/ARP.LIB b/Lib/Rabbit4000/tcpip/ARP.LIB
index 21da61980..dd28e81de 100644
--- a/Lib/Rabbit4000/tcpip/ARP.LIB
+++ b/Lib/Rabbit4000/tcpip/ARP.LIB
@@ -1724,9 +1724,10 @@ _arp_nodebug ATHandle router_add(longword ipaddr, word iface,
 #endif
 
    if (ath > 0) {
+		int idx = ATH2INDEX(ath);
    	// If in the ATE already, locate its RTE.
 		for (i = 0, rte = _arp_gate_data; i < ARP_ROUTER_TABLE_SIZE; i++, rte++)
-			if (rte->ath && ATH2INDEX(rte->ath) == ATH2INDEX(ath)) {
+			if (rte->ath && ATH2INDEX(rte->ath) == idx) {
 				if (!flags && (subnet != rte->u.preconfig.subnet || mask != rte->u.preconfig.mask))
 					continue;	// Skip if preconfigured, but not exactly matching parameters
 #ifdef ARP_VERBOSE
@@ -1740,6 +1741,11 @@ _arp_nodebug ATHandle router_add(longword ipaddr, word iface,
 				}
 				if (flags & RTE_REDIRECT && rte->flags & (RTE_TRANSIENT | RTE_DHCP))
 					flags &= ~RTE_REDIRECT;	// Don't mark transients as redirects.
+
+				// if not P2P, flag existing ARP table entry as a router entry
+				if (idx < ARP_TABLE_SIZE)
+					_arp_data[idx].flags |= ATE_ROUTER_ENT;
+
 				goto _arp_set_entry;
 			}
    }

If the arp entry was in some state of processing where other structure members were set, would just
_arp_data[idx].flags |= ATE_ROUTER_ENT;
be enough to "change" the entry to a router? Later on, a new insert is:
ath = arpcache_load(ath, NULL, iface, ATE_ROUTER_ENT, i);
It's probably overkill to remove and then add as new, but just worried about missing some other init setting.

I'm pretty sure we're holding a global TCP lock here, so it should be safe to set that bit in the flags at any time. If you follow the execution path, the arpcache_load() call only happens if it needs to create a new entry. My addition covers the case where we're making changes to an existing ARP table entry.

Let me know if you're able to test, and especially if you can come up with a test case that reliably fails with the old code and passes with the new code. I'm thinking you'd need to be on a largish network. Ping the router so you get an ARP entry, then add it as the default route, then ping more devices than you have ARP table entries for. Then make sure you still have an entry in the table for the router.

I should be able to do some testing when I'm back in my office next week.

Ok, I just wanted to make sure the existing ARP table entry flags were all being [re]set correctly, since the old execution path only created a new entry.

We were in a jam, so my method of testing was, gulp, putting the new code out at the customer site and having them verify things worked properly now. Prior to the change, connectivity from remote ip addresses wasn't possible. But, the code I put out there was prior to your change, and I just had replicated the arpcache_load call even in the case of an existing table entry. Your code seems cleaner, but I can't update in the field at this point. Hopefully you've got an environment you can spin up for some testing.

@daveoman, I was able to do some testing this afternoon, and my proposed fix did not work. I dug into it further, and have a simpler fix that I've now pushed into the repository. I'm attaching the code I used to test, and hopefully you can run your own tests at some point in the future and let me know if you continue to see failures.

Here's the code I used for testing:

#undef TCPCONFIG

#define TCPCONFIG           1
#define BASE_NETWORK        "192.168.1."
#define _PRIMARY_STATIC_IP  BASE_NETWORK "5"
#define _PRIMARY_NETMASK    "255.255.255.0"
#define MY_NAMESERVER       "8.8.8.8"

// start with a bad gateway
#define MY_GATEWAY          BASE_NETWORK "23"

#define ACTUAL_GATEWAY      BASE_NETWORK "1"

#define ARP_VERBOSE
#define ARP_DEBUG

#use "dcrtcp.lib"

void full_dump(void)
{
	arpcache_printall();
	router_printall();
}

void ping_me(const char *ip)
{
	static longword seq = 0;
	longword ping_who, result, tmp_seq, timeout;
	
	ping_who = resolve(ip);
	if (ping_who == 0) {
		printf("ARP resolve failed\n");
		return;
	}
	printf("Pinging 0x%08LX...", ping_who);
	_ping(ping_who, ++seq);
	
	timeout = _SET_TIMEOUT(1000);
	while (!_CHK_TIMEOUT(timeout)) {
		result = _chk_ping(ping_who, &tmp_seq);
		if (result != 0xFFFFFFFF) {
			printf("response in %lums\n", result);
			return;
		}
	}
	
	printf("timed out\n");
}

int main(void)
{
	int i;
	char buffer[16];
	
	sock_init_or_exit(1);

	full_dump();
	
	// ping the router to get its MAC address
	ping_me(ACTUAL_GATEWAY);
	
	// change gateway to actual router
	ifconfig(IF_ETH0, IFS_ROUTER_SET, aton(ACTUAL_GATEWAY), IFS_END);

	full_dump();
	
	// ping a bunch of other IP addresses to try flushing out the table
	for (i = 10; i < 30; ++i) {
		sprintf(buffer, "%s%u", BASE_NETWORK, i);
		ping_me(buffer);
	}

	// see if the router is still set		
	full_dump();	
}

@tomlogic, there is still a path thru router_add that wouldn't set the ATE_ROUTER_ENT bit. When looping thru the _arp_gate_data for an applicable entry, if found there is a
goto _arp_set_entry;
that would jump over the setting of the ATE_ROUTER_ENT flag.

I'm guessing that if there happens to be a router entry in the router table, there isn't a circumstance where the arp table entry isn't set correctly?

Correct, I am assuming that with this bug fix, it isn't possible to have a router table entry that doesn't have the ATE_ROUTER_ENT bit set in its linked ARP table entry. If that is possible, I would want to fix the root bug.