Homas/ioc2rpz

pdns-recursor lost connection

mk-git opened this issue · 12 comments

Hi

Now another observation:
I have two bind servers (9.10.3-P4-Debian) and two powerdns-recursors (debian 4.1.11) that query the ioc2rpz server.

I have the SOA time for a test RPZ zone set to 60 seconds. Then I see how all 4 servers cleanly report every minute for an IXFR transfer. But as soon as I reload the ioc2rpz configuration

curl -i -u "tkey_mgmt_1:XXXXXXXXXXXXX==" --insecure -H "Accept: text/plain" https://127.0.0.1:8443/api/mgmt/reload_cfg

I see at most one more query (IXFR) of the PDNS servers and after that there is no more communication with the ioc2rpz.

After that I have to restart the service for the pdns-recursor on both pdns to start the communication again. The two BIND servers seem to continue doing this without any problem.

Do you have an idea what this could be?

Here is the recursor.lua config and the bind config:

rpzMaster("xxx.xx.1xx.123", "sXXXXX-test.ioc2rpz", { tsigname="querykey", tsigalgo="hmac-md5", tsigsecret="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX=="})

zone "sXXXX-test.ioc2rpz" {
        type slave;
        file "/etc/bind/zones/black/sXXXX-test.ioc2rpz";
        masters {xxx.xx.1xx.123 key "querykey";};
        allow-query { any; };

        masterfile format text;
};
Homas commented
  1. Was the zone serial updated? Did you observe SOA requests before the IXFR requests?
  2. What PowerDNS version are you using?
    pdns recently (in 4.3.1 version) changed the way how it checks zones updates and they introduced a new parameter "refresh". I've observed some weird behaviour on the community website DNS but didn't have a chance yet to validate it in my lab.

On the two powerdns recursor servers I used the Debian package in version 4.1.11

The zone serial number should change when the entries (blacklist txt) are changed, correct? --> I checked, happens like this.

I just adjusted the blacklist and now I have to wait 900s until he reloads the file to see if the serial number changes. Again, the two bind servers have always applied the changes. --> I checked that, at least the BIND servers do that and the serial number has changed.

--> After adjusting the blacklist.txt I did not reload the ioc2rpz or change anything. I only waited the 900 seconds.

It could also be that if the recursor once had no connection to the initial opened dns thread of the ioc2rpz server (because of a reload of the ioc2rpz) it doesn't try to establish a new connection anymore because the target socket is closed then?

There was an issue with pdns (PowerDNS/pdns#4678) but the fix seems to have been done with 4.1.x.

A bit later: So as soon as the serial number of the zone has increased, it will be updated on the BIND. The two powerdns recursors do NOT do any more IXFR transfers for the zone from the moment the serial number changes. Also, I don't see any other log entry in the pdns that points to anything. There is just no communication with the ioc2rpz anymore.

Should I see the SOA requests in the log?
--> Ah, I saw something in the logs of the ioc2rpz. The BIND servers make a SOA request to the ioc2rpz every 2 minutes.

CEF:0|ioc2rpz|ioc2rpz|1.1.2.1-2020072101|000202|DNS Query|3|src=8x.xxx.xx.xx spt=37323 proto=udp qname="sXXXXX.ioc2rpz" qtype="SOA" qclass="IN" tsigkey="querykey.

of the two pdns I see NO SOA requests.

Homas commented

Thanks for the extensive details. I'll try to reproduce it.

Homas commented

I was able to reproduce the issue. I'll investigate.

By the way, I saw on your github sponsor page that you need server capacity. How can I contact you outside github? EMail?

Homas commented

A short update.

  1. I've identified the function which looks like stuck in an infinite look on pdns side. I need spend more time to understand and debug their code. Zone transfer from bind works w/o issues. There are a slight difference in ioc2rpz vs bind responses. They still both comply to the RFC.
    As a temporary workaround pdns can pull RPZ feeds from bind. ioc2rpz can send notifications on zone updates.

  2. There is one more issue in pdns, they do not request SOA record before IXFR. DNS servers on IXFR requests can return full zone (AXFR) which could be pretty big.

You can contact me by email: feedback (at) ioc2rpz [.] net

Homas commented

I've found the issue on pdns side. I'll report it to pdns team but first I need to check if I can adjust the behaviour on ioc2rpz side.

Homas commented

I've fixed the issue on ioc2rpz side in dev branch. I need to test it for a while in production before merging to master.
I'll follow up with pdns folks regarding a bug on their side.
I'll also open one more bug to track why ioc2rpz responds with the full zone transfer on IXFR requests from pdns.

It seems to be working. I made a new entry in the test blacklist txt and after 900 seconds the following log messages appeared (which seems to be correct). And now he has recorded the new serial number and makes further IXFR requests. Cool!

Do I have to pay attention to the problem that you found in the recursor?

Small side question: Wouldn't it make sense to have the option for "HotCacheTime" in the config file? So you don't have to rebuild the container every time you want to change the time? Or should I create a new issue for it?

Aug 24 06:50:35 nscX pdns_recursor[112649]: Getting IXFR deltas for XXXXX-test.ioc2rpz from 123.123.123.123:53, our serial: 1598243640
Aug 24 06:50:35 nscX pdns_recursor[112649]: Processing 1 delta for RPZ XXXXXX-test.ioc2rpz
Aug 24 06:50:35 nscX pdns_recursor[112649]: IXFR update is a whole new zone
Aug 24 06:50:35 nscX pdns_recursor[112649]: Had 0 RPZ removals, 18 additions for XXXXX-test.ioc2rpz New serial: 1598244600
Aug 24 06:51:35 nscX pdns_recursor[112649]: Getting IXFR deltas for XXXXX-test.ioc2rpz from 123.123.123.123:53, our serial: 1598244600
Homas commented

I was considering that.
What is the use case to reduce HotCacheTime time? Are you doing full zone update in less than 900 seconds?
There is HotCacheTimeIXFR parameter which is used for incremental updates.

Last time discussion: #10

In one of our last discussions you wrote me back then that if I want a changed file (blacklist.txt) to be re-read faster than in 900 seconds (and not come out of the cache), I have to change the setting HotCacheTime in ioc2rpz.hrl (and then rebuild the container).

Currently I enter a new test entry (exampleX.com) and then I have to wait until ioc2rpz recognizes the new entry, increases the serial number and I can look on the PDNS and Bind to see if the change is accepted.

Especially when testing the above mentioned bugs it would be easy if I didn't have to wait 900 seconds first until the file with the change has to be read in. So I thought it would be easier if you could determine the setting easy in the normal Config.

Otherwise I set the values for my test zone to 60/60/120/60/60

Another use case: We receive the information that there is a phishing campaign targeting our customers and want to block the target domain as soon as possible. If I add it to our blacklist, it should take about 900 seconds until the change is in ioc2rpz and is distributed. Or have I misunderstood something?

Homas commented

No you understood everything I was just wondering why 15 minutes hot cache will not work. OSINT is not frequently updated, and I don't think that paid TI is updated faster what 15 minutes (usually a few times a day).

Anyway If you care about only adding new indicators only (every 60 seconds) when incremental zone update should work. Full zone update is required when some indicators w/o expiration date should be removed.
With the full zone update all the history of incremental changes will be lost and secondary servers will get a full zone.

I've considered to add the options per server or per source. I'll add it into a TODO list. It will also require changes in UX/UI.

I created an enhancement request #32