janeczku/go-dnsmasq

Timeouts, possibility to improve

Closed this issue · 10 comments

It seems I'm getting a bunch of timeouts of the sort:

level=error msg="Failure forwarding request \"read udp 172.17.0.3:57073->8.8.8.8:53: i/o timeout\""

While I realize that this may not be necessarily the fault of go-dnsmask, I'm wondering if we can somehow improve the resulting behaviour. I'm using it in a VM in hyper-v, on a laptop I'm using for work. I'm resolving some internal projects and letting the rest forward to 8.8.8.8. as seen above. It's quite frequent that some host won't resolve, and usually running ipconfig /flushdns solves the problem on windows side.

I didn't notice any timeout option in the settings, is this something that's set by a DNS rfc? Any suggestions what I can do? Sure I'm realizing my internet is not the greatest, but on the other hand, 100ms ping to anywhere and 50/20mbit line on LTE doesn't seem like the ISDN speeds I was used to in the 90's.

P.s. I'm running dnsmasq via docker, with the -f hostfile option, and no other options.

Hi again, if i understood correctly you are using just 8.8.8.8 as upstream DNS server? Then the first step for mitigating the timeout issues is to add a second DNS server. Wireless connections are always prone to packet loss no matter if its 56kbit/s over GPRS or 50mbit over LTE. Heck, even my 200mbit FTTH connection exhibits packet loss from time to time when there is a bad router along the path.
Thats why one should pass at least two DNS servers to go-dnsmasq so it has something to fallback to in case of a lost packet.
Considering read timeout: This is actually already now a configuration parameter but i didn't care to expose this in the CLI. Currently the UDP read timeout is set to 2 seconds. If adding a second upstream DNS server doesn't help with the timeout errors, we can add a CLI flag to make the timeout settable.

FYI, the principle embraced by go-dnsmasq when forwarding the query is to move along to the next available (if any) nameserver if a connection error occurs instead of retrying a possibly unavailable server. It's just more efficient.

FYi, maybe I'm remembering this incorrectly, but some DNS servers also provide resolving over TCP? A quick search lets me know this has occurred somewhere in the 2010+ era?

Edit: seems you have it implemented - level=info msg="Ready for queries on tcp://0.0.0.0:53 [rcache 0] Nameservers: [8.8.8.8:53 8.8.4.4:53]"; I passed the second google server, hope it will improve behaviour. Enabled --verbose also. Still seeing some errors, seems that a minimum cache TTL would be a good option as a default?

I have thought again about the read timeout option. I don't think it will help in your case. These errors you see are caused by lost or dropped packages (Google DNS servers do drop requests when you hit some rate limits). If the packet is dropped/lost a longer read timeout won't help it. If these errors happen frequently even with two configured name servers than we probably should change the retry routine to make it resend a failed request at least once (I think this is how the Linux DNS client does it as well).

but some DNS servers also provide resolving over TCP

DNS over TCP is a fallback mechanism in case the DNS record exceeds the 512 bytes packet limit specified for DNS over UDP.

seems that a minimum cache TTL would be a good option as a default?

You mean minimum value the TTL of a response must have before we store it in the cache?

No, I'm actually considering that a minimum cache expiry > 0 seconds after
resolve would be good behaviour. I think I've accrued several examples of
successfully resolving a hostname and because I don't have any caching set
explicitly, it doesn't resolve on some try after the first one. Examples in
the logs are slightly repetitive but it goes for the same fqdn resolve
about 10 queries, of which the first 4 resolve OK. If the dns response was
cached out of the box for let's say 1 minute, it might be fine? Or even a
shorter TTL of dnsmasq cache, it's hard to gauge since the logs don't have any timestamps...

Well, for one, there seems to be no logging of successful queries to DNS hosts and what they reply with even with --verbose.

level=debug msg="Querying nameserver 8.8.8.8:53 question live.github.com."
level=debug msg="Received DNS query for \"live.github.com.\" from \"10.1.1.1:52060\" with type 1"
level=debug msg="Querying nameserver 8.8.8.8:53 question live.github.com."
level=debug msg="Received DNS query for \"live.github.com.\" from \"10.1.1.1:52060\" with type 1"
level=debug msg="Querying nameserver 8.8.8.8:53 question live.github.com."
level=debug msg="Error querying nameserver 8.8.8.8:53: \"read udp 172.17.0.3:40810->8.8.8.8:53: i/o timeout\""
level=debug msg="Querying nameserver 8.8.4.4:53 question live.github.com."
level=debug msg="Error querying nameserver 8.8.8.8:53: \"read udp 172.17.0.3:40810->8.8.8.8:53: i/o timeout\""
level=debug msg="Querying nameserver 8.8.4.4:53 question live.github.com."
level=debug msg="Error querying nameserver 8.8.8.8:53: \"read udp 172.17.0.3:40810->8.8.8.8:53: i/o timeout\""
level=debug msg="Querying nameserver 8.8.4.4:53 question live.github.com."
level=debug msg="Received DNS query for \"github.com.\" from \"10.1.1.1:59022\" with type 1"
level=debug msg="Querying nameserver 8.8.8.8:53 question github.com."

This is the example i'm talking about. I assume the first fer queries resolve fine, and the rest fall into some kind of blacklist/rate limit and are ignored. Caching the first successful response for some N seconds (30, 60?) would probably solve it. But I'm not experiencing some issues after adding googles NS2, so I'm not sure if it's even an issue anymore...

Well, for one, there seems to be no logging of successful queries to DNS hosts and what they reply with even with --verbose.

Since we blindly return whatever reply we get from the upstream server that information is available in the client. If you need this kind of information in the logs though (for example the Rcode of the reply) go ahead and send a PR!

No, I'm actually considering that a minimum cache expiry > 0 seconds after
resolve would be good behaviour.

Ah i see, that should be possible right now. Caching is disabled completely by default though.
You need set --rcacheto some positive value to enable caching and if you want to control TTL of cache entries you can do so with --rcache-ttl (set to 60 seconds by default).
By the way: --rcache-ttl sets both the minimum and maximum TTL 😆 each response will be cached for exactly 60 seconds. While looking at the code i discovered a possible bug that might prevent re-caching of previously expired entries. So i would be glad if you could give it a spin and check if its working for you.

it's hard to gauge since the logs don't have any timestamps...

Yes that needs to be fixed.

FYI, logs are now printed with timestamp (when there is no terminal attached). Closing this issue for now.