oetiker/SmokePing

smokeping fails to start (timeout) if DNS temporarily unavailable

lelutin opened this issue ยท 36 comments

Hi there,

This was submitted by Michael Deegan as a bug report to debian, so I'm forwarding it here.

The original bug report is here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996824


This issue occurs when my smokeping host comes up before my dnsmasq host
(which currently waits during boot for my VDSL modem to gain sync...).

The result is that smokeping takes longer to start up than systemd is willing to wait:

root@yakka:~# systemctl status smokeping.service
โ— smokeping.service - Latency Logging and Graphing System
     Loaded: loaded (/lib/systemd/system/smokeping.service; enabled; vendor preset: enabled)
     Active: failed (Result: timeout) since Tue 2021-10-19 06:35:59 AWST; 11h ago
       Docs: man:smokeping(1)
             file:/usr/share/doc/smokeping/examples/systemd/slave_mode.conf
    Process: 1187 ExecStart=/usr/sbin/smokeping --pid-dir=/run/smokeping (code=killed, signal=TERM)
        CPU: 418ms

Oct 19 06:34:28 yakka systemd[1]: Starting Latency Logging and Graphing System...
Oct 19 06:34:47 yakka smokeping[1187]: WARNING: Hostname 'mazikeen' does currently not resolve to an IPv6 or IPv4 address
Oct 19 06:34:59 yakka smokeping[1187]: WARNING: Hostname 'jackpc' does currently not resolve to an IPv6 or IPv4 address
Oct 19 06:35:12 yakka smokeping[1187]: WARNING: Hostname 'vins-pc' does currently not resolve to an IPv6 or IPv4 address
Oct 19 06:35:24 yakka smokeping[1187]: WARNING: Hostname 'hoffman' does currently not resolve to an IPv6 or IPv4 address
Oct 19 06:35:36 yakka smokeping[1187]: WARNING: Hostname 'sebastian.murdoch.edu.au' does currently not resolve to an IPv6 or IPv4 address
Oct 19 06:35:48 yakka smokeping[1187]: WARNING: Hostname 'dummy' does currently not resolve to an IPv6 or IPv4 address
Oct 19 06:35:59 yakka systemd[1]: smokeping.service: start operation timed out. Terminating.
Oct 19 06:35:59 yakka systemd[1]: smokeping.service: Failed with result 'timeout'.
Oct 19 06:35:59 yakka systemd[1]: Failed to start Latency Logging and Graphing System.

I think that in the interests of robustness, it would be better that startup
not involve attempting to resolve every target hostname. Perhaps DNS
activity could instead be deferred until after the daemon forks?

This issue has become stale and will be closed automatically within 7 days. Comment on the issue to keep it alive.

Comment.

Can possibly any dev weight on it? Frankly this seems to be a major issue, nothing short of opening for a wilful DoS.

Worth noting that in my setup I use short timeout setting (2 sec, not the usual 15) so this initial probe is also ignoring this (otherwise my instance would start).

have you tried adding a dependency to the dnsmasq service ... so that the smokeping service starts after dnsmasque ?

That would work except for two use cases (I actually use smokeping in):

  • when the subnet hosting smokeping service has been cut from its resolver (with the switch port fault);
  • when the smokeping tries to resolve external hostname (eg google.com) while disconnected from the internet during an outage.

I agree with the initial reporter - there's no good reason to fail the start of the service when the one check of the one target happens to fail during the service restart.

Appreciate the suggestion though.

This is not a Smokeping issue, it is working as expected.
Run a light resolver like Dnsmasq on the same machine as Smokeping instead.

If the issue is just during startup, start the smokeping service after the name service:

[Unit]
After=network.target
Wants=network-online.target

I already run dnsmasq on the router PC on my LAN. Running a second instance on every workstation/server seems unnecessary and pointless. Also, no number of dnsmasq instances will help if the LAN is not (yet) connected to the internet (eg. immediately following a power outage, which is the case I had in mind in my original bugreport).

I disagree that startup should fail if internet connectivity is unavailable at the moment of startup. Please just fix your program to be friendly to dynamically changing network configuration. ๐Ÿ™

@Strykar oh interesting, if that addition to the unit file works then I shall include it in the debian package (and thus it's indeed not an issue for upstream). the unit file already has the "After" line but not the "Wants" one.

@miiichael can you try the following on the computer where you run smokeping and see if it helps avoid the startup issue?

  1. mkdir -p /etc/systemd/system/smokeping.service.d
  2. echo "[Unit]\nWants=netowork-online.target > /etc/systemd/system/smokeping.service.d/wait_for_network_online.conf
  3. systemctl daemon-reload
  4. now run tests to see if the dns resolving has issues in your setup

I'll try it, but the issue isn't that smokeping is starting before local networking - it's starting before internet connectivity exists (either due to an ISP outage, or we've just recovered from a power outage and the VDSL modem is still negotiating sync). My understanding is that adding network-online.target just makes it start after ifup, yes?

michael@yakka:~$ find /*/systemd/system/network-online.target* -ls
   264068      4 drwxr-xr-x   2 root     root         4096 Dec  8  2017 /etc/systemd/system/network-online.target.wants
   265016      0 lrwxrwxrwx   1 root     root           38 Dec  8  2017 /etc/systemd/system/network-online.target.wants/networking.service -> /lib/systemd/system/networking.service
   265715      4 -rw-r--r--   1 root     root          513 Feb  2  2021 /lib/systemd/system/network-online.target

My current workaround is to add some janky cron jobs. ๐Ÿ˜…

@reboot root sleep 300; systemctl status smokeping.service >/dev/null || (echo "Smokeping is bad. Let's try restarting it."; systemctl restart smokeping.service)
@reboot root sleep 360; systemctl status smokeping.service >/dev/null || (echo "Smokeping is still bad. Let's try restarting it."; systemctl restart smokeping.service)
@reboot root sleep 600; systemctl status smokeping.service >/dev/null || (echo "Smokeping is *still* bad. Let's try restarting it."; systemctl restart smokeping.service)

@lelutin There are multiple issues being conflated here, the other posters should create a separate issue for their own. Please test and share your results here. That is the systemd-way of service dependency.

@miiichael As a network stream latency grapher, what do you think smokeping should behave as if there is no network or name service available that it is configured to probe?
A majority of the Internet would stop functioning without DNS, do share this vision on how you see smokeping being the exception here. Please articulate your thoughts instead of just sharing the same link again.
Which, incidentally refers to network.target in relation to dynamic network interfaces, not streams.

I think it would be reasonable for smokeping to report that all internet hosts report as unreachable when the internet is unreachable, exactly as if I'd specified those hosts by IP (while still reporting on hosts within my LAN, which of course remain resolvable).

This issue has become stale and will be closed automatically within 7 days. Comment on the issue to keep it alive.

Comment.

This issue has become stale and will be closed automatically within 7 days. Comment on the issue to keep it alive.

Comment.

This issue has become stale and will be closed automatically within 7 days. Comment on the issue to keep it alive.

Comment.

I had the same issue and just added

/etc/systemd/system/smokeping.service.d/override.conf file with

[Service]
Restart=on-failure
RestartSec=5s

As I wanted it to keep running if possible.

knofte commented

Yeah, either use
Restart=always
or change so you add, as you have .service you also have .timer (here's an example https://documentation.suse.com/smart/systems-management/html/systemd-working-with-timers/index.html )
And with smokeping.timer you can add for example 60-90 second timer after reboot before starting smokeping, thus waiting for the vdsl link to come back up.

Starting a networking monitoring software while your internet is down, does not seem to be a problem to be solved in the monitoring, but the underlying problem, thus fixing the Internet Down problem with your ISP or UPS. :)

Well yes, but the bug here is that smokeping fails to monitor the hosts it can reach (ie. inside my LAN, which is the overwhelming majority of the hosts I'm monitoring) if it starts while my upstream link is temporarily down. Periodically restarting smokeping at intervals is just a workaround.

This issue has become stale and will be closed automatically within 7 days. Comment on the issue to keep it alive.

Comment.

This issue has become stale and will be closed automatically within 7 days. Comment on the issue to keep it alive.

Comment.

@miiichael FYI I found out that systemd has a target named nss-lookup.target that can delay service startup until hostname lookup should work. I've pushed a change to the unit file in debian in the git repository (not yet released). Could I ask you to test it out and see if that fixes your issue?

basically I've changed from:

After=network.target

to:

After=nss-lookup.target

in the unit file.

Thanks! This should help ensure LAN hosts are resolvable (in case my server boots somehow before the router), but there still remains the issue of smokeping not starting if any WAN hosts are unreachable/unresolvable (eg. because the router hasn't finished bringing up the WAN interface yet).

@miiichael I see. At least now in the debian unit file the service should wait until network is fully online locally.
After more looking into this and comparing to other unit files in other packages, I've also added what @Strykar was suggesting. so now the debian unit file has those lines:

After=network-online.target nss-lookup.target
Wants=network-online.target

However, I don't think we can fix, with changes in the systemd unit file, the issue of not starting up if actual internet connectivity is not complete because of an external service like the router. For that, changes to smokeping's code would be needed in order to avoid bailing out completely upon startup.

However, I don't think we can fix, with changes in the systemd unit file, the issue of not starting up if actual internet connectivity is not complete because of an external service like the router. For that, changes to smokeping's code would be needed in order to avoid bailing out completely upon startup.

I am not sure it should be "fixed" in a network monitoring application like Smokeping.
Instead, host Smokeping on a $5 VM so it can log your monitored hosts being down.
What are your LAN clients doing worth monitoring if their WAN is down?

Smokeping (used by ISPs and network providers worldwide) changing to accommodate people using it on unreliable residential connection is going backwards IMO.

But said $5 VM isn't going to be able to reach my internal hosts. If WAN is down my NAS and reticulation controller still works. I also host my mail locally.

That smokeping wants to resolve all IPs on startup suggests that it would not notice if DNS entries change some time after startup...

I am not sure it should be "fixed" in a network monitoring application like Smokeping. Instead, host Smokeping on a $5 VM so it can log your monitored hosts being down. What are your LAN clients doing worth monitoring if their WAN is down?

I'm not sure if the latency/availability collector like Smokeping should be the judge about the temporary DNS issues. It creates the problems because someone believes that temporary DNS issue is a critical fault for it.
But to say only that is to say nothing -- it WILL fail even if the ONE configured host fails. So you can have all the hosts in order, whole network wonderful and working, but if your ISP happens to restart you residential connection and you were cheeky enough to try to monitor the other side of the connection, everything will fail.

I can't imagine whole Nagios server refusing to start because one of the monitored servers/services is down.
Do you see how absurd it is?

Edit: also the fact Smokeping doesn't crash/exit upon DNS problems happening later, once it's started, shows it's just an inconsistent behaviour. But also Smokeping crashing because of that would be a quite significant bug, right? You should request for this to be added to the code though, so the "unreliable [] connection" owner is adequately treated.

Smokeping (used by ISPs and network providers worldwide) changing to accommodate people using it on unreliable residential connection is going backwards IMO.

I'm sorry but this is just so awfully rude.

Are you a product manager for Smokeping?
If yes, amend this:

SmokePing is a deluxe latency measurement tool. It can measure, store and display latency, latency distribution
and packet loss. SmokePing uses RRDtool to maintain a longterm data-store and to draw pretty graphs, 
giving up to the minute information on the state of each network connection.

I don't see anything about enterprise, low latency, low loss networks.
You just invalidated your all previous comments with this one.

Good tools are transparent and agnostic. This tool still uses RRD, which proves it's not designed and made with huge amount of datapoints and enterprise in mind.
And I've had my share of DNS failure in the well managed enterprise networks as well, which further renders your comment a complete non-sequitur.

You're doing a generally good piece of software a disservice "defending it" like this.

Umm, @lelutin What version are you using?
Because the code (for past year at least) have this:

            unless ($addressfound) {
               # do not bomb, as this could be temporary
               my $tried = join " or ", @tried;
               warn "WARNING: Hostname '$_' does currently not resolve to an $tried address\n" unless $cgimode;
            }

Which seems to indicate that the issue shouldn't exist.
I can not find your exact error message anywhere in the code nor service either.
Are you sure you've updated smokeping to latest version, so we're not troubleshooting something old here...?

I.e. please verify on latest version, as this could be related only to whatever version debian has packaged.

Is it possible that it was systemd itself giving up on smokeping due to it taking a while to start (on account of repeatingly trying to contact not-yet-available DNS)? ๐Ÿค”

@knofte great question.. but I was just responding to a part of the issue, where systemd would make sure to have hostname resolution before starting the service. I was not the one experiencing the issue. (e.g. I forwarded it here from debian. miichael was the original reporter of the issue)

@miiichael which version of smokeping were/are you using? since you reported on debian and I failed to bump the version of smokeping for the package there for a while I'm wondering if you were using 2.7.3 from debian stable. fwiw there's now 2.8.2 in debian unstable if you'd like to test things out with this version. with enough luck, and judging by what @knofte said, if you're using 2.7.3 it's possible that your issue may go away with the newer version.

Yes, I was running Debian stable (2.7.3). I've updated to testing (2.8.2), and also disabled the bandaids I put in cron that restarted the service if it wasn't running.

After wading through old logs and some systemd manpages, I'm now wondering if my specific situation can be worked around by setting an unreasonably large (12 seconds per DNS name mentioned in config) TimeoutStartSec in the systemd unit...

jlu5 commented

I'm running into this too. In my case, I don't think the issue is even the network being down or a majority of sites failing to resolve, but rather the volume of destinations...

My workaround so far has been to patch this bit of code out: https://github.com/oetiker/SmokePing/blob/master/lib/Smokeping.pm#L2491-L2509. This lets Smokeping start up way faster, and doesn't seem to affect my probes.

I do think warning about broken destinations is useful, but it really should happen in the background (instead of blocking startup) as none of these issues should be fatal.