NLnetLabs/unbound

Priming the root key fails when RPZ is enabled in Unbound 1.20.0 or later

Opened this issue · 11 comments

Describe the bug
After updating Unbound from 1.19.1 to 1.20.0, trust anchor auto-update fails.
The first time Unbound starts, it loads the trust anchor and resolves the domain name.
However, when the trust anchor expires after one day, it is not automatically renewed and domain resolution fails.

To reproduce
Steps to reproduce the behavior:

  1. Start Unbound.
    →Trust anchor is loaded.
  2. The trust anchor expires after one day.
    →It is not automatically renewed and domain resolution fails.
    At this time, the contents of the root.key file were also not updated.

Expected behavior
Trust anchor is automatically updated.

System:

  • Unbound version: 1.20.0 , 1.21.0
  • OS: RHEL8

Additional information
Execute dig(failure case)

[root@xxx ~]# dig @(unbound server) www.google.com

; <<>> DiG 9.11.36-RedHat-9.11.36-11.el8_9.1 <<>> @(unbound server) www.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 34544
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; Query time: 286 msec
;; SERVER: xxx.xx.xx.xx#53(xxx.xx.xx.xx)
;; WHEN: Thu Sep 12 16:03:38 JST 2024
;; MSG SIZE  rcvd: 43

Execute dig(dnnsec)

[root@xxx ~]# dig +dnssec @(unbound server) jprs.jp

; <<>> DiG 9.11.36-RedHat-9.11.36-11.el8_9.1 <<>> +dnssec @(unbound server) jprs.jp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14244
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
;; QUESTION SECTION:
;jprs.jp.                       IN      A

;; ANSWER SECTION:
jprs.jp.                300     IN      A       117.104.133.164
jprs.jp.                300     IN      RRSIG   A 8 2 300 20241005023003 20240905023003 61491 jprs.jp. Bmdphm9M3RVFGlrmO2Kt4bSKTvyMzr2/S/rbgjCAd4ImgTo7eAlZAEWd fZ8rkhv70sdCfWYNKPLewUqV8ofUTsssP9NxXy+dh8SJaA6Ur8ADxVre EhITCTrcg1Gv36yrw1Z6T2wHaiu3kjXjpB38d6W4dpcEVDnr6iAo8AZ+ OJI=

;; Query time: 17 msec
;; SERVER: xxx.xx.xx.xx#53(xxx.xx.xx.xx)
;; WHEN: Thu Sep 12 16:05:05 JST 2024
;; MSG SIZE  rcvd: 219

Error Log

unbound[1386]: [1386:1] info: failed to prime trust anchor -- could not fetch DNSKEY rrset . DNSKEY IN 

If Unbound is restarted before the trust anchor expires, it will not be able to resolve domain.
The condition for this event to occur was when RPZ was enabled. The update was possible when RPZ was disabled.
In addition to RPZ, the following settings are in place where DNSSEC is concerned.

	module-config: "respip validator iterator"
	auto-trust-anchor-file: "/var/unbound/root.key" 

Also, I have verified that the root.key permissions are set appropriately.

It seems this is not really about trust anchor update, but about priming the root key? Because that is what the error says. But the issue talks about trust anchor update failure, where is that update happening, or was that a terminology mistake?

Of course unbound needs a working trust anchor, and for the root that means for all domain names, that are not covered by other trust anchors or domain insecure points. The error says validation of the trust anchor fails, saying it could not fetch the root DNSKEY RRset. And the issue is that this fails when RPZ is enabled. Is RPZ blocking the lookup of the root DNSKEY? That could depend on the contents of the RPZ file. The contents of it with respect to the root DNSKEY lookup, what is that contents?

It may be useful to enable verbosity at a level of 4 or 5, and then log to file, and then have unbound start with RPZ enabled, so that it can catch the error with debug output enabled. That could show in the debug logs, where the lookup of the root DNSKEY RRset is failing. Set rpz-log: yes and perhaps also rpz-log-name: "myrpz" so that unbound logs rpz applications in detail.

There have been fixes in the unbound 1.20.0 release, and recent releases, for RPZ processing, in particular in the iterator. So it seems likely that the update causes the RPZ zone to be applied at a different point in the code.

That was a terminological mistake.
Set verbosity to 5 and run dig on www.google.com.
Excerpts of large differences from logs with successful name resolution.

unbound[3543179]: [3543179:0] info: prime trust anchor
unbound[3543179]: [3543179:0] info: generate request . DNSKEY IN
unbound[3543179]: [3543179:0] info: generate keytag query _ta-4f66. NULL IN
unbound[3543179]: [3543179:0] info: generate request _ta-4f66. NULL IN
unbound[3543179]: [3543179:0] info: respip operate: query . DNSKEY IN
unbound[3543179]: [3543179:0] info: validator operate: query . DNSKEY IN
unbound[3543179]: [3543179:0] info: resolving . DNSKEY IN
unbound[3543179]: [3543179:0] info: rpz: applied [XXX.XXX] . rpz-local-data  . DNSKEY IN
unbound[3543179]: [3543179:0] info: validator operate: query . DNSKEY IN
unbound[3543179]: [3543179:0] info: respip operate: query . DNSKEY IN
unbound[3543179]: [3543179:0] info: validator: inform_super, sub is . DNSKEY IN
unbound[3543179]: [3543179:0] info: super is www.google.com. A IN
unbound[3543179]: [3543179:0] info: autotrust process for . DNSKEY IN
unbound[3543179]: [3543179:0] info: failed to prime trust anchor -- could not fetch DNSKEY rrset . DNSKEY IN
unbound[3543179]: [3543179:0] info: validator operate: query www.google.com. A IN
unbound[3543179]: [3543179:0] info: no signer, using www.google.com. TYPE0 CLASS0
unbound[3543179]: [3543179:0] info: prime trust anchor
unbound[3543179]: [3543179:0] info: generate request . DNSKEY IN
unbound[3543179]: [3543179:0] info: generate keytag query _ta-4f66. NULL IN
unbound[3543179]: [3543179:0] info: generate request _ta-4f66. NULL IN
unbound[3543179]: [3543179:0] info: respip operate: query . DNSKEY IN
unbound[3543179]: [3543179:0] info: validator operate: query . DNSKEY IN
unbound[3543179]: [3543179:0] info: resolving . DNSKEY IN
unbound[3543179]: [3543179:0] info: rpz: applied [XXX.XXX] . rpz-local-data  . DNSKEY IN
unbound[3543179]: [3543179:0] info: validator operate: query . DNSKEY IN
unbound[3543179]: [3543179:0] info: respip operate: query . DNSKEY IN
unbound[3543179]: [3543179:0] info: validator: inform_super, sub is . DNSKEY IN
unbound[3543179]: [3543179:0] info: super is www.google.com. A IN
unbound[3543179]: [3543179:0] info: autotrust process for . DNSKEY IN
unbound[3543179]: [3543179:0] info: failed to prime trust anchor -- could not fetch DNSKEY rrset . DNSKEY IN
unbound[3543179]: [3543179:0] info: validator operate: query www.google.com. A IN
unbound[3543179]: [3543179:0] info: no signer, using www.google.com. TYPE0 CLASS0

After the above is repeated several times, the following log is recorded.

unbound[3543179]: [3543179:0] info: Could not establish a chain of trust to keys for . DNSKEY IN

Also, when the unbound service was restarted and the root.key update failed, the following log was recorded.

unbound[28383]: [28383:0] info: autotrust probe . DNSKEY IN
unbound[28383]: [28383:0] info: respip operate: query . DNSKEY IN
unbound[28383]: [28383:0] info: validator operate: query . DNSKEY IN
unbound[28383]: [28383:0] info: resolving . DNSKEY IN
unbound[28383]: [28383:0] info: rpz: applied [XXX.XXX] . rpz-local-data  . DNSKEY IN
unbound[28383]: [28383:0] info: validator operate: query . DNSKEY IN
unbound[28383]: [28383:0] info: respip operate: query . DNSKEY IN
unbound[28383]: [28383:0] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 1 recursion replies sent, 0 replies dropped, 0 states jostled out
unbound[28383]: [28383:0] info: average recursion processing time 0.328093 sec
unbound[28383]: [28383:0] info: histogram of recursion processing times
unbound[28383]: [28383:0] info: [25%]=0 median[50%]=0 [75%]=0
unbound[28383]: [28383:0] info: lower(secs) upper(secs) recursions
unbound[28383]: [28383:0] info:    0.262144    0.524288 1

What settings in the contents of the rpz file would cause this behavior?

unbound[3543179]: [3543179:0] info: rpz: applied [XXX.XXX] . rpz-local-data  . DNSKEY IN
unbound[3543179]: [3543179:0] info: failed to prime trust anchor -- could not fetch DNSKEY rrset . DNSKEY IN

From the first line we see that Unbound is not allowed to resolve and a local-data reply is coming from the RPZ.
In the second line we see a complain that the validator does not have a DNSKEY record to operate on.

I believe an entry in the RPZ zone matches and returns a local record to the . DNSKEY query.

Do you get a different behavior with the same RPZ file (with the exact same contents) and Unbound 1.19.1?

Using the same RPZ file (with exactly the same contents), the root key priming was successful up to Unbound 1.19.3.
On the other hand, Unbound 1.20.0 or later failed, and I have not been able to figure out the cause.

Is it possible to share the RPZ file? You can find my email at https://nlnetlabs.nl/people/.

@gthess
Thank you for your support.
I shared the RPZ file via email.

I believe the problem lies with the TXT record you have at the apex. This means that there are local data for . and then RPZ blocks all queries for . except for the TXT record. Can you remove that record and try again?

You can easily see if you block . by a quick dig @(unbound server) . NS. That should return the root delegation.

yukjn4 commented

The TXT record for the apex domain has been deleted and the problem has been confirmed resolved.
Thank you for your support.
Before we close this issue, I would like to ask,
is this behavior originally expected and was incomplete in versions prior to 1.20.0?

This behavior is according to the RPZ specification. 1.20.0 introduced numerous RPZ fixes and some of them were about local data and passthru.