
Inconsistent and unexpected return behavior from QueryMultiple()

Closed this issue · 0 comments

retryabledns version:


Current Behavior:

I am getting varying results from QueryMultiple() depending on the resolvers configured and retry count configured. I either get the most recent error or no error depending on which resolver was the last used and if it is error prone. I am working with a slow network and possibly not the best resolvers, but this shouldn't produce inconsistent behavior in the retryabledns library.

Expected Behavior:

For the same query type and host being looked up, I get the same return behavior from DNSX and retryabledns irrespective of how I have the retries and base resolvers configured. A new error being generated that the max retries has exceeded would be more sane than the current behavior.

Steps To Reproduce:

Please use the code and test using CoreDNS that I have put together to demonstrate this issue at . It's a simplified version of the code that pinpoints the issue.

  1. Run CoreDNS in a docker image with the supplied Corefile. Feel free to use the start/stop/restart scripts. This has an entry that will drop the packet for a lookup on to simulate a slow network or resolver that returns a read timeout . This resolver is on localhost via docker
  2. Step through the code with dlv debug main.go
  3. The call stack will be (d *DNSX) QueryOne(hostname string) -> (c *Client) Query(host string, requestType uint16) -> (c *Client) QueryMultiple(host string, requestTypes []uint16) -> (c *Client) queryMultiple(host string, requestTypes []uint16, resolver Resolver) .
  4. The first function is a dnsx function which wraps retryabledns. All the others are retryabledns calls and where I see the issue so I am logging it here.
  5. Set a breakpoint on line 327 of .
  6. Observe that on the first iteration of the loop, an error is returned on line 361 when the DNS query is made over UDP. This is the i/o timeout. Stepping through the code, the loop will continue.
  7. Next iteration of the loop - a different resolver is attempted ( This time a resp and no err is returned on line 361. Stepping through the code, the dnsdata struct is populated, but the loop continues because the Rcode is NXDOMAIN and not SUCCESS.
  8. Step through the loop 3 more times and either err or resp will be returned depending on which was the last resolver.
  9. If localhost was the last resolver, err is returned and percolated up the call chain such that DNSX returns the MOST_RECENTLY_IDENTIFIED_ERROR in the loop.
  10. Rerun the code, but set retries to 4 instead of 5. Observe that err is not returned
  11. LIne 190 inside (c *Client) Do(msg *dns.Msg) returns a new error stating that the max retries was exceeded. Replicating that here could be a reasonable fix.
  12. I am happy to submit a PR with the above fix if that is agreed as the right way to resolve this issue.
  13. The end result was that I ended up coding my own retry loop around the dnsx call because I thought the lib was erroring out early. It turns out it's just propagating the error incorrectly.

Anything else: