Inconsistent and unexpected return behavior from QueryMultiple()
Closed this issue · 0 comments
retryabledns version:
v1.0.73
Current Behavior:
I am getting varying results from QueryMultiple() depending on the resolvers configured and retry count configured. I either get the most recent error or no error depending on which resolver was the last used and if it is error prone. I am working with a slow network and possibly not the best resolvers, but this shouldn't produce inconsistent behavior in the retryabledns library.
Expected Behavior:
For the same query type and host being looked up, I get the same return behavior from DNSX and retryabledns irrespective of how I have the retries and base resolvers configured. A new error being generated that the max retries has exceeded would be more sane than the current behavior.
Steps To Reproduce:
Please use the code and test using CoreDNS that I have put together to demonstrate this issue at https://github.com/calab33p/dnsx_bug . It's a simplified version of the code that pinpoints the issue.
- Run CoreDNS in a docker image with the supplied Corefile. Feel free to use the start/stop/restart scripts. This has an entry that will drop the packet for a lookup on timeout.example.com to simulate a slow network or resolver that returns a read timeout . This resolver is on localhost via docker
- Step through the code with dlv debug main.go
- The call stack will be (d *DNSX) QueryOne(hostname string) -> (c *Client) Query(host string, requestType uint16) -> (c *Client) QueryMultiple(host string, requestTypes []uint16) -> (c *Client) queryMultiple(host string, requestTypes []uint16, resolver Resolver) .
- The first function is a dnsx function which wraps retryabledns. All the others are retryabledns calls and where I see the issue so I am logging it here.
- Set a breakpoint on line 327 of github.com/projectdiscovery/retryabledns@v1.0.73/client.go .
- Observe that on the first iteration of the loop, an error is returned on line 361 when the DNS query is made over UDP. This is the i/o timeout. Stepping through the code, the loop will continue.
- Next iteration of the loop - a different resolver is attempted (8.8.8.8). This time a resp and no err is returned on line 361. Stepping through the code, the dnsdata struct is populated, but the loop continues because the Rcode is NXDOMAIN and not SUCCESS.
- Step through the loop 3 more times and either err or resp will be returned depending on which was the last resolver.
- If localhost was the last resolver, err is returned and percolated up the call chain such that DNSX returns the MOST_RECENTLY_IDENTIFIED_ERROR in the loop.
- Rerun the code, but set retries to 4 instead of 5. Observe that err is not returned
- LIne 190 inside (c *Client) Do(msg *dns.Msg) returns a new error stating that the max retries was exceeded. Replicating that here could be a reasonable fix.
- I am happy to submit a PR with the above fix if that is agreed as the right way to resolve this issue.
- The end result was that I ended up coding my own retry loop around the dnsx call because I thought the lib was erroring out early. It turns out it's just propagating the error incorrectly.