dns: overflowing header size
Opened this issue · 1 comments
Got two RKE2-clusters. Internally RKE2 uses a CoreDNS-base service (rke2-coredns-rke2-coredns
, image: rancher/hardened-coredns:v1.11.1-build20240305
) to provide name resolution for the pods. These, in turn, pass to the hosts name resolution for upstream resolving.
Yesterday I switched the hosts name resolution over to Gravity, and this morning we had quite some problems in most of the RKE2 pods. The rke2-coredns-rke2-coredns
pods were logging a lot of errors of this exact format
plugin/errors: 2 login.microsoftonline.com. A: dns: overflowing header size
Unfortunately I had to switch back from Gravity to the old resolution service, restarting the pods the error messages immediately disappeared.
It seems this host (login.microsoftonline.com
) gives quite many answers, maybe the overflowing issue is related to this?
For the moment I can't replicate the issue, the clusters are semi-production and cannot experiment with them further. But I'd imagine setting up a chain like coredns -> gravity -> upstream and then querying the coredns for login.microsoftonline.com
might replicate the issue.
Spinned up a coredns instance (here: 192.168.210.5
port 54) that forwards all queries to my Gravity instance (here: 192.168.210.8
). So when querying the coredns instance, the path is client -> coredns -> Gravity -> internet.
If I query coredns with something that results in very large responses, the response begins with "Truncated, retrying in TCP mode.", for example with the login.microsoftonline.com
lookup (have yet to find another name that results in such large output).
$ nslookup -port=54 login.microsoftonline.com 192.168.210.5
;; Truncated, retrying in TCP mode.
Server: 192.168.210.5
Address: 192.168.210.5#54
Non-authoritative answer:
login.microsoftonline.com canonical name = login.mso.msidentity.com.
login.mso.msidentity.com canonical name = ak.privatelink.msidentity.com.
ak.privatelink.msidentity.com canonical name = www.tm.ak.prd.aadg.trafficmanager.net.
Name: www.tm.ak.prd.aadg.trafficmanager.net
Address: 20.190.181.5
Name: www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.9
Name: www.tm.ak.prd.aadg.trafficmanager.net
Address: 20.190.181.4
Name: www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.17
Name: www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.8
Name: www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.10
Name: www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.7
Name: www.tm.ak.prd.aadg.trafficmanager.net
Address: 20.231.128.67
Querying my Gravity instance directly does not result in the truncation and TCP retry, same output just without the truncation.
$ nslookup login.microsoftonline.com 192.168.210.8
Server: 192.168.210.8
Address: 192.168.210.8#53
Non-authoritative answer:
login.microsoftonline.com canonical name = login.mso.msidentity.com.
login.mso.msidentity.com canonical name = ak.privatelink.msidentity.com.
ak.privatelink.msidentity.com canonical name = www.tm.ak.prd.aadg.akadns.net.
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.149
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.83
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.22
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.10
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.85
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.19
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.5
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.9
Now, if I alter the path to swap out Gravityfor our old (and working) DNS forwarder:
client -> coredns -> dnsmasq -> internet
even the long login.microsoftonline.com
query works without the truncating and retrying over TCP.
$ nslookup -port=54 login.microsoftonline.com 192.168.210.5
Server: 192.168.210.5
Address: 192.168.210.5#54
Non-authoritative answer:
login.microsoftonline.com canonical name = login.mso.msidentity.com.
login.mso.msidentity.com canonical name = ak.privatelink.msidentity.com.
ak.privatelink.msidentity.com canonical name = www.tm.ak.prd.aadg.akadns.net.
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.10
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.85
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.19
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.5
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.9
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.149
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.83
Name: www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.22
So there appears to be something not working correctly when using Gravity that works when substituting Gravity for dnsmasq. Let me know if I can provide logs, pcaps or whatnot.