munin-monitoring/contrib

/proc/pagetypeinfo: Cannot convert '>100000' to float

dwreski opened this issue · 13 comments

Hi,
I'm having an issue with what I believe is the meminfo plugin on fedora33 (although this error has been around for ever).

2020/12/14 12:15:18 [ERROR] In RRD: Error updating /var/lib/munin/bwimail03/bwimail03-pagetypeinfo-n0_zNormal_tMovable-fp_n0_zNormal_tMovable_o0-g.rrd: /var/lib/munin/bwimail03/bwimail03-pagetypeinfo-n0_zNormal_tMovable-fp_n0_zNormal_tMovable_o0-g.rrd: Function update_pdp_prep, case DST_GAUGE - Cannot convert '>100000' to float

Maybe it needs to be cast as a "long float" somewhere?

I believe it's with the load_pagetypeinfo function, but I haven't attempted to troubleshoot it fully.

I can paste my /proc/pagetypeinfo contents here, but it's unlikely it would format correctly. I also don't think my file is unique where it would even make a difference.

There are also many lines like this, going back for many years. I don't know if it's related or if I should open another ticket, but I also have no idea how to troubleshoot this or how to obtain more info to troubleshoot it.

2020/12/14 12:35:12 [WARNING] 20 lines had errors while 2483 lines were correct in data from 'config meminfo' on cipher/209.216.11.60/4949

There's a huge amount of output from "munin-run meminfo config" but no obvious errors.

Ideas greatly appreciated, and I'll help to provide as much info as I can.

Interesting!

Could you show the output of munin-run meminfo, please?

I suspect, that one of the fields contains the literal string >100000.

Here is the output from one system attached here, although it happens on every system. A literal "100000" doesn't appear anywhere in the output or in the meminfo plugin anywhere. Also, "phisical" is spelled incorrectly throughout - it should be "physical"

munin-meminfo.txt

After taking at your data, I suspect, that the master cannot handle 64 bit integer values. At least I noticed, that around 20 values in your output are bigger than 2^32. Is the master a 32 bit host?

This would help my understanding of the problem. But it should of course be fixed, even if it is a 32 bit system ...

Also, "phisical" is spelled incorrectly throughout - it should be "physical"

Yes, that is an annoyance, but I am hesitant to fix this typo (being visible only in the filenames), since it would break the history for these graphs.

Regarding the line in your log:

2020/12/14 12:35:12 [WARNING] 20 lines had errors while 2483 lines were correct in data from 'config meminfo' on cipher/209.216.11.60/4949

Did you notice other errors or warning just above this one? I assume, that munin should emit a warning for each problematic line.

If there are no error messages, then please find the file Munin/Master/Node.pm and replace DEBUG with WARN in the following two lines:

  • DEBUG "[DEBUG] Protocol exception: unrecognized line '$line' from $plugin on $nodedesignation.\n";
  • DEBUG "[DEBUG] Protocol exception while fetching '$service' from $plugin on $nodedesignation: unrecognized line '$line'";

The next run of munin-update (every five minutes) should expose these interesting log messages.

This is a 64-bit host running fedora33.

Yes, each host which has the meminfo plugin running reports the same problems. Only the number of errors varies slightly between each host.

We have not noticed any other errors or warnings related to the meminfo plugin.

Here are the results for one host after making the DEBUG/WARN changes above.

2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_Acpi.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_anon.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_biovec.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_btrfs.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_caches.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_dma.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_dmaengine.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_ext4.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_jbd2.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_kmalloc.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_kmem.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_network.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_other.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_proc.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_request.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_skbuff.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_task.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [DEBUG] Protocol exception: unrecognized line 'slab_size_summ_xfs.info ' from meminfo on arcade/107.155.66.2/4949.
2020/12/16 14:29:53 [WARNING] 18 lines had errors while 2035 lines were correct in data from 'config meminfo' on arcade/107.155.66.2/4949

Thanks for the result!

The line errors above are probably not related to the numeric problem, but I would like to fix it anyway. Please share the output of munin-run meminfo config. Then I will be able to fix it, I guess.

Regarding the number conversion problem: the above config dump will also help me to reproduce this.

Requested info attached.
meminfo-config.txt

I also have the following output from the ntp_kernel_pll_freq and ntp_kernel_err plugins. None of these plugins work at all with ntp-4.2.8p15.

2020/12/16 16:35:15 [DEBUG] Protocol exception while fetching 'ntp_kernel_err' from ntp_kernel_err on arcade/107.155.66.2:4949: unrecognized line 'ntp_err.value '
2020/12/16 16:35:15 [WARNING] 1 lines had errors while 0 lines were correct (100.00%) in data from 'fetch ntp_kernel_err' on arcade/107.155.66.2:4949

2020/12/16 16:35:13 [DEBUG] Protocol exception while fetching 'ntp_kernel_pll_freq' from ntp_kernel_pll_freq on arcade/107.155.66.2:4949: unrecognized line 'ntp_pll_freq.value '
2020/12/16 16:35:13 [WARNING] 1 lines had errors while 0 lines were correct (100.00%) in data from 'fetch ntp_kernel_pll_freq' on arcade/107.155.66.2:4949
2020/12/16 16:35:15 [DEBUG] Protocol exception while fetching 'ntp_kernel_err' from ntp_kernel_err on arcade/107.155.66.2:4949: unrecognized line 'ntp_err.value '

still experiencing this issue - anyone have any ideas?

2021/02/11 19:30:18 [ERROR] In RRD: Error updating /var/lib/munin/xavier/xavier-pagetypeinfo-n0_zNormal_tMovable-fp_n0_zNormal_tMovable_o0-g.rrd: /var/lib/munin/xavier/xavier-pagetypeinfo-n0_zNormal_tMovable-fp_n0_zNormal_tMovable_o0-g.rrd: Function update_pdp_prep, case DST_GAUGE - Cannot convert '>100000' to float

I took another look at the data you provided (the output of fetch and config).
I prepared a dummy plugin emitting this content locally:

#!/bin/sh

case "${1:-fetch}" in
    fetch)
        cat /root/munin-meminfo.txt
        ;;
    config)
        cat /root/meminfo-config.txt
        ;;
esac

Here on my system munin was happy to digest this input (emitting the [WARNING] x lines had errors while y lines were correct log message - just as it did for you).
btw: I fixed the handling of empty fields now (4759e06179d9cf4569448bd8b114b61b551705ed), thus the log noise will go down in the future.

The munin-update procedure ran successfully, the rrd files were created/updated and the graphs were drawn.
I could not find error messages in /var/log/munin/munin-cgi-graph.log or /var/log/munin/munin-graph.log.

Thus it seems, that the same set of input leads to errors on your side and is handled without issues on my side.
Maybe we should compare our environments?

$ uname -a
Linux foo 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64 GNU/Linux

$ dpkg -l | grep -E "(munin|rrd)" | awk '{print($1, $2, $3)}'
ii librrd8:amd64 1.7.2-3+b7
ii librrds-perl:amd64 1.7.2-3+b7
ii munin 2.0.66-1
ii munin-async 2.0.66-1
ii munin-common 2.0.66-1
ii munin-doc 2.0.66-1
ii munin-node 2.0.66-1
ii munin-plugins-core 2.0.66-1
ii munin-plugins-extra 2.0.66-1
ii rrdtool 1.7.2-3+b7

Hi Lars, thanks for your help. The most obvious difference is that this is on Fedora, not Debian. Nearly the same kernel, though.

# uname -a
Linux foo 5.9.12-200.fc33.x86_64 #1 SMP Wed Dec 2 15:16:37 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qva | grep -E "(munin|rrd)"
rrdtool-perl-1.7.2-14.fc33.x86_64
rrdtool-1.7.2-14.fc33.x86_64
munin-common-2.0.65-1.fc33.noarch
munin-node-2.0.65-1.fc33.noarch
munin-apache-2.0.65-1.fc33.noarch
munin-2.0.65-1.fc33.noarch

This is from the munin system on which the munin script runs. On the munin server that collects from the munin nodes:

# uname -a
Linux foo 5.8.18-200.fc32.x86_64 #1 SMP Mon Nov 2 19:49:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qva | grep -E "(munin|rrd)"
munin-node-2.0.65-1.fc32.noarch
rrdtool-1.7.2-7.fc32.x86_64
munin-2.0.65-1.fc32.noarch
munin-common-2.0.65-1.fc32.noarch
rrdtool-perl-1.7.2-7.fc32.x86_64
munin-apache-2.0.65-1.fc32.noarch

Thanks for your information.
I was hoping for an old version of rrdtool :(

Maybe you could try to delete the rrd files causing these errors and see, whether the problem appears again?
(I am a bit lost, where the problem could be - thus I am just guessing)

Sadly, that didn't fix it. There also aren't any fields in /proc/pagetypeinfo greater than 100,000 that would present a problem converting to float. I do see other references to "Function
update_pdp_prep, case DST_GAUGE - Cannot convert '' to float" with regards to rrdtool, so perhaps it is an rrdtool bug?

I've attached my /proc/pagetypeinfo from one of the systems here.

pagetypeinfo.txt

Stale issue message