nsd 4.8 stats with implausible spikes
and0x000 opened this issue ยท 8 comments
With nsd 4.8 we see unexplainable spikes from the num.rcode
and num.type
metrics.
The spikes seem to appear at a fixed frequency. The data is implausible in the context of other networking metrics for the system (such as the overall number of packets). Also, the spikes do not show up for nsd 4.7 on a comparable system.
(Graph from graphana, data comes from a custom crawler that queries nsd-control stats
every 30 seconds or so)
I'd assume that this has something to do with the changes introduced in #305 but I lack the understanding of C
to pinpoint anything.
NSD version 4.8.0
Written by NLnet Labs.
Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-configdir=/etc/nsd --with-nsd_conf_file=/etc/nsd/nsd.conf --with-pidfile=/run/nsd/nsd.pid --with-dbfile=/var/lib/nsd/nsd.db --with-zonesdir=/etc/nsd --with-xfrdfile=/var/lib/nsd/xfrd.state --disable-largefile --disable-recvmmsg --enable-root-server --enable-mmap --enable-ratelimit --enable-zone-stats --enable-systemd --enable-checking --enable-dnstap
Event loop: libevent 2.1.12-stable (uses epoll)
Linked with OpenSSL 1.1.1w 11 Sep 2023
Copyright (C) 2001-2020 NLnet Labs. This is free software.
There is NO warranty; not even for MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.
Hi @and0x000. It's certainly suspicious that these started appearing with 4.8.0, but I'll have to investigate this first before I can say anything useful. I'll have a look.
@and0x000, I trust you're using the exact same scripts with NSD 4.7? Would it be possible to post the correcsponding graph for the number of queries? It would be useful to know if this problem is limited to the type
and rcode
counters.
Hi @k0ekk0ek , thank you for looking into it.
Yes, the crawler script is identical for both NSD 4.7 and 4.8. So is the rest of the system. Only the NSD packet was updated.
It doesn't seem to be limited to the type
and rcode
counters, they were only the ones for which I had immediate graphs at hand.
From what I can see, the following metrics are affected:
num.queries
server<#>.queries
num.truncated
num.dropped
num.type
num.rcode
num.class
num.[udp|tcp|tls]
My hypothesis is that the copying of statistics is not an atomic operation. So when memcpy
'ing the statistics for reporting an update might occur that corrupts the values in memory. I could reimplement the logic, I'd need to do that anyway, but it might be worth it to implement a stress-test. That doesn't guarantee we'll hit the issue though... I need to give this some more thought
@and0x000, would it be possible to collect the raw numbers for a couple of days in text files? I've been through the code, the logic isn't all that complicated, so if the numbers really don't add up (as in, they wrap around or similar) I don't see how it's anything other than my non-atomic hypothesis and we'll have to add somekind of synchronization mechanism. It's just that we'll have to insert extra logic for each query that's processed, which is not desirable(?)
@and0x000, would it be possible to collect the raw numbers for a couple of days in text files?
Yea, I'll write the raw stats output to a file. With the frequency of the spikes occurring, I believe there should be enough data to compare against within a few hours. Is there a way that I can provide you with the raw data? 'Cause I don't necessarily want it to end up public.
Thanks! Yes, you may send it via email to jeroen -at- nlnetlabs -dot- nl. Alternatively, if it's to much data for email, we could use some file sharing service.
The commit fixes it, thanks for the report! The detailed logs revealed that the statistic values themselves are fine, and do not show the spikes. But the time.elapsed printout is wrong. It turns out that the change in 4.8 causes the stats_noreset lookup to reset the timer for the time.elapsed printout. This was not the behaviour before that. When the elapsed time is very small, because another stats lookup just happened, then the numqueries divided by time becomes the very large spike value. The commit fixes it, the other stats lookup does not reset the time, and thus the elapsed time should be printed out correctly.