riemann/riemann

[regression] InfluxDB lib should be fed consistent data type

Closed this issue · 12 comments

XANi commented

After upgrading 0.2.11 -> 0.2.14 we've started getting errors:

java.lang.RuntimeException: {"error":"partial write: field type conflict: input field \"value\" on measurement \"riemann\" is type integer, already exists as type float dropped=3"}

I've noticed that when some of counters are zero, they are sent to influxdb as integer (influxdb lib generates value withi at the en of it, then when they are increased, they are sent as float:

riemann,host=hal1,plugin_instance=longterm,type=gauge,type_instance=accepted value=0.199541055572184 1503596351000000000
riemann,host=hal1,plugin_instance=longterm,type=gauge,type_instance=completed value=0.199541055572184 1503596351000000000
riemann,host=hal1,plugin_instance=longterm,type=gauge,type_instance=rejected value=0i 1503596351000000000

those are just internal riemann counters wrapped in (fixed-time-window 10 (smap folds/mean) graph-default) (graph-default is just my helper that calls influxdb with right options).

We've also had different error (altho that one doesn't show up as reliably):

unable to parse 'protocols,aggregate=max,host=efikom116.non.3dart.com,plugin_instance=Tcp,type=protocol_counter,type_instance=CurrEstab value=� 1503596468000000000': invalid boolean\n

which seems to be related to influxdata/influxdb-java#39

but on wire (tcpdump) it looks like that

 protocols,aggregate=mean,host=efikom116.non.3dart.com,plugin_instance=TcpExt,type=protocol_counter,type_instance=TCPTimeouts value=... 1503597973000000000

value morphed to ... somehow...

Curiously enough, that one happened only when I restarted source of events (collectd) and it sent all plugin output at once and stopped when I (re-)added batching (via (batch 10000 10 graph))

@mcorbin Any thoughts?

edit : i was wrong, new explanations soon

(I was wrong in my previous message.)

Hi,

I completely refactored the influxdb stream in Riemann 0.2.13. Before 0.2.13, Riemann used its own method to construct the influxdb messages. Now, Riemann uses the official influxdb java client.

I was able to reproduce the first error (but inverted, float instead of int).

Riemann uses this function (deprecated btw) to construct the Influxdb Point object.

As you can see, the value is always converted to double.

Before Riemann 0.2.13, it was the same thing (cf here):

riemann.bin> (clojure.pprint/cl-format nil "~F" 0)
"0.0"

In Riemann, the :metric field for async-queue rejected rate is calculated like this:

:metric  (/ drejected dtime)

With basically drejected = 0 when you don't have rejected events.

But...

riemann.service> (/ 0 (unix-time))
0N
riemann.service> (type (/ 0 (unix-time)))
clojure.lang.BigInt
riemann.service> (instance? BigInteger (/ 0 (unix-time)))
false

=> The field is not converted to float.

We should probably converts this field to something else in Riemann (double ?).
As a workaround, you can converts it yourself using smap on rejected rate$ events.

Regarding the second error, that's strange. Do you have the exact Riemann event generating the tcpdump line ?

XANi commented

Hmm interesting. I did wonder why it only occured with that one event, not dozens of other 0 metrics received from collectd.

Regarding the second error, that's strange. Do you have the exact Riemann event generating the tcpdump line ?

It's not an event but series of events. I will try to compile some testcase for it. Should I make separate ticket for it ?

Hmm interesting. I did wonder why it only occured with that one event, not dozens of other 0 metrics received from collectd.

It's because 0 metrics from collectd are Long or Integer, and not clojure.lang.BigInt.

Should I make separate ticket for it ?

No, you can use this issue i think ;)

XANi commented

Got it, seems to be what happens if it gets NaN as metric:

INFO [2017-08-29 07:26:51,499] defaultEventExecutorGroup-2-1 - riemann.config - {:description nil, :tags [collectd], :service protocols-TcpExt/protocol_counter-TCPOrigDataSent, :time 1503984399, :type protocol_counter, :host nuc-efikom116, :ttl 30.0, :plugin_instance TcpExt, :aggregate mean, :type_instance TCPOrigDataSent, :plugin protocols, :metric NaN}
WARN [2017-08-29 07:26:51,502] defaultEventExecutorGroup-2-1 - riemann.streams - riemann.influxdb$influxdb_deprecated$streams__9726@afdc1c8 threw
java.lang.RuntimeException: {"error":"unable to parse 'protocols-TcpExt/protocol_counter-TCPOrigDataSent,aggregate=mean,host=nuc-efikom116,plugin=protocols,plugin_instance=TcpExt,type=protocol_counter,type_instance=TCPOrigDataSent value=� 1503984399000000000': invalid boolean"}

generated by

(def graph-default
  (with {:ds_type nil :ds_index nil :state nil}
        #(info %)
        (influxdb influxdb-creds-default)
        )
  )

InfluxDB doesn't seem to handle NaNs in a any way.

Just dropping metics with Double/NaN seems to fix it, as is switching collectd StoreRates to false (that is "calculate rate of counters before sending to riemann"). It seems that CollectD, when configured to calculate rate, sends NaN for every counter it sees for first time and dont have second datapoint to calculate rate from.

Thank you for investigating ;)
My PR #849 should fix your first issue (i convert BigInt to double in the influxdb stream).

XANi commented

Thanks. Second one seems to be issue with InfluxDB so I doubt it is worth fixing it here (except maybe warning in docs that influxdb doesn't like NaNs)

If it's OK for you, can we close this issue ?

XANi commented

yeah sure, thanks for help :)

Uploading 16548534334235266531599330469347.jpg…
Could you help me from this . It's very important to me

Hi @Sravan0124 whatever the image is - it did not get correctly uploaded.