slackhq/nebula

Domain name in static_host_map

Closed this issue · 10 comments

Using domain name instead of ip in static_host_map may result in nebula ending up in a state when the process is still running, but no attempts to reconnect to lighthouse will ever occur.

The situation happens during initial boot when interface is already up and network-online.target seem to be fulfilled, but dhcpcd is yet to set a nameserver. If nebula is to start in that time frame (which in my case it does 9 out of 10 times), it won't be able to parse lighthouse IP, report lighthouse unreachable due to missing static_host_map entry (see #41) and continue to run while making no attempts to establish connection.

In my opinion, if we consider #41 to be a configuration issue, the error should be fatal. If on the other hand we consider it an operational issue, we should resolve the host at connect time rather than during config parsing.

I have the same problem on my Ubuntu laptop. I need to restart nebula service to get it working every time after system boots up.

Nebula v1.2.0 is still affected.

I confirm this is still not working 100% on 1.2.0. Seen on Ubuntu and Debian 9.

Confirmed not working on CentoOS 7 on Nebula 1.1.0

This is with the current 1.3.0 release.

I have reproduced this issue on several Windows revisions (10, server 2016, server 2019). On Windows, as a workaround, I have manually set the service start up type for Nebula Network Service to Automatic (Delayed). With this set the connection establishes properly when using the FQDN of the lighthouse server rather than its IP address, but the connection may not be available until several minutes after the machine is otherwise available.

I have reproduced this issue on both the latest Mac OS X revisions as well, but I haven't developed a workaround as of yet so I am specifying the IP address of the lighthouse. I have not tested on Mac OS 11 as of yet. I expect similar behavior, but will confirm once I have an opportunity to test.

The only useful solution here would probably involve regularly re-querying names. If we go down this path, we should also consider re-querying even after we get a successful answer, as this would allow us to migrate to a new IP if the underlying DNS entry changes.

I think I might have encountered the same problem after a power outage had my single lighthouse unreachable for a few hours and when it came back on a new public IP none of the nodes that where connected wouldn't ever reconnect without restarting nebula on them.

It would be good to try to resolve the dns entry for a host in the static hosts every now and then when a static host is unreachable.

@enykeev @bartmichu @windwalker78 @keitme @SgtZapper If you're still encountering this error, would you try adding a Wants=nss-lookup.target to your systemd file? This should cause the system to wait for DNS resolution to be available prior to starting nebula.

I'm closing this issue out as #791 has landed. We believe this should solve the startup race.

In addition to #791, #796 is released and working in v1.7.1 and should re-query for DNS even if the initial query for DNS fails. By default, we will re-query on a 30s cadence, but this can be configured via static_map.cadence.