AdguardTeam/dnsproxy

Frequent "failed to exchange" errors

timkgh opened this issue · 12 comments

dnsproxy[1303399]: 2024/05/06 00:40:59 [error] dnsproxy: upstream https://1.1.1.2:443/dns-query failed to exchange

dnsproxy v0.71.1

Requests also take a long time to time out when this happens, around 6s.

It happens for both DoH (h3://) and DoT (tls://) upstreams, on Linux both amd64 and arm64. Observed for both Quad9 and Cloudflare upstreams.

We'll need more information about this, could you please record a full log?

Does this help?

2024/05/05 22:01:59 [error] dnsproxy: upstream https://1.1.1.2:443/dns-query failed to exc
hange ;play.google.com.        IN         A in 26.008209ms: requesting https://1.1.1.2:443/dns-query: Get_0rtt "https://1.1.1.2:443/dns-query?dns=AAABAAA...": use of closed network connection

2024/05/06 00:40:59 [error] dnsproxy: upstream https://1.1.1.2:443/dns-query failed to exc
hange ;addons-pa.clients6.google.com.        IN         A in 5.979076ms: requesting https://1.1.1.2:443/dns-query: Get_0rtt "https://1.1.1.2:443/dns-query?dns=AAAB...": H3_NO_ERROR

2024/05/06 11:10:59 [error] dnsproxy: upstream https://1.1.1.2:443/dns-query failed to exchange ;waa-pa.clients6.google.com.        IN         A in 31.069075ms: requesting https://1.1.1.2:443/dns-query: Get_0rtt "https://1.1.1.2:443/dns-query?dns=AAABAAA...": use of closed network connection

2024/05/06 13:05:59 [error] dnsproxy: upstream https://1.1.1.2:443/dns-query failed to exchange ;addons-pa.clients6.google.com.        IN         A in 26.284076ms: requesting https://1.1.1.2:443/dns-query: Get_0rtt "https://1.1.1.2:443/dns-query?dns=AAABAAA...": use of closed network connection

Not really:(

We'll need to see the whole lifetime of a connection to figure out what's going on with them.

Could you please share the log in private via devteam@adguard.com?

H3_NO_ERROR looks strange, doesn't it?

Can this not be reproduced/debugged on your side by using --all-servers with h3:// and tls:// for both 9.9.9.11 and 1.1.1.2? I'm sure you'll get to see these errors after a while.

All those related to H3/Quic may be caused by different quic-go library issues, I am tbh more interested in what you had with DoT.

Errors in the log may be reproduced, at some point the server closes the connection and we handle this gracefully, just repeating the request. I don't yet understand where the 6 seconds timeout is getting from. If it happens with DoT than this is very strange. If it happens with H3/DoQ then it's understandable since quic-go does not handle GOAWAY frames from the server properly at the moment.

The timeouts (or errors that take a long time) happen with TLS too:

2024/05/09 10:38:24 [error] dnsproxy: upstream tls://149.112.112.11:853 failed to exchange ;fireoscaptiveportal.com.        IN         A in 6.112033ms: reading response from tls://149.112.112.11:853: EOF

2024/05/09 14:01:15 [error] dnsproxy: upstream tls://[2620:fe::fe:11]:853 failed to exchange ;vscode-sync.trafficmanager.net.        IN         A in 5.283909ms: reading response from tls://[2620:fe::fe:11]:853: EOF

I inspected the DoT upstream code and tbh I believe this is some network issue and not a bug in dnsproxy.

DoT retries establishing a new connection when there's any error with the cached one but according to your log even though the new connection was established it's getting closed right away. I don't see how dnsproxy itself could do that in the code.

The timeouts (or errors that take a long time) happen with TLS too:

I was suddenly seeing timeouts too with DoT using v0.71.1, so removed all DoT from upstream-servers last week and only using Quic/H3 now without any issues. Strange...

@iJorgen the same DNS upstream?

@iJorgen the same DNS upstream?

Sorry @ameshkov, only keeping logs for a few days and not using DoT upstreams at the moment.
Smells like a new release of DNSProxy soon, so might try DoT again. 😃

I inspected the DoT upstream code and tbh I believe this is some network issue and not a bug in dnsproxy.

DoT retries establishing a new connection when there's any error with the cached one but according to your log even though the new connection was established it's getting closed right away. I don't see how dnsproxy itself could do that in the code.

I observe the same issues on 2 different cloud provider networks, both IPv4 and IPv6, 2 different architectures (Linux amd64 and arm64), 2 different upstreams (Quad9, Cloudflare) and for both DoT and DoH. Maybe there are no actual issues and it's just log noise.

I'm also seeing this with multiple different udp upstreams, v0.71.2 . All requests fail

2024/06/01 00:44:23 [error] dnsproxy: x.x.x.x:53: response received over udp: "exchanging with x.x.x.x:53 over udp: read udp y.y.y.y:55286->x.x.x.x:53: i/o timeout"
2024/06/01 00:44:23 [error] dnsproxy: x.x.x.x:53: response received over udp: "exchanging with x.x.x.x:53 over udp: read udp y.y.y.y:51487->x.x.x.x:53: i/o timeout"
2024/06/01 00:44:23 [error] dnsproxy: x.x.x.x:53: response received over udp: "exchanging with x.x.x.x:53 over udp: read udp y.y.y.y:51189->x.x.x.x:53: i/o timeout"
2024/06/01 00:44:23 [error] dnsproxy: upstream x.x.x.x:53 failed to exchange ;example.com.	IN	 A in 4.000258183s: exchanging with x.x.x.x:53 over udp: read udp y.y.y.y:51189->x.x.x.x:53: i/o timeout
2024/06/01 00:44:23 [error] dnsproxy: upstream x.x.x.x:53 failed to exchange ;example.com.	IN	 AAAA in 4.000582435s: exchanging with x.x.x.x:53 over udp: read udp y.y.y.y:51487->x.x.x.x:53: i/o timeout
2024/06/01 00:44:23 [error] dnsproxy: upstream x.x.x.x:53 failed to exchange ;example.com.	IN	 HTTPS in 4.000550868s: exchanging with x.x.x.x:53 over udp: read udp y.y.y.y:55286->x.x.x.x:53: i/o timeout

Edit: nevermind, it was my firewall issue