[regression] InfluxDB lib should be fed consistent data type
Closed this issue · 12 comments
After upgrading 0.2.11 -> 0.2.14 we've started getting errors:
java.lang.RuntimeException: {"error":"partial write: field type conflict: input field \"value\" on measurement \"riemann\" is type integer, already exists as type float dropped=3"}
I've noticed that when some of counters are zero, they are sent to influxdb as integer (influxdb lib generates value withi
at the en of it, then when they are increased, they are sent as float:
riemann,host=hal1,plugin_instance=longterm,type=gauge,type_instance=accepted value=0.199541055572184 1503596351000000000
riemann,host=hal1,plugin_instance=longterm,type=gauge,type_instance=completed value=0.199541055572184 1503596351000000000
riemann,host=hal1,plugin_instance=longterm,type=gauge,type_instance=rejected value=0i 1503596351000000000
those are just internal riemann counters wrapped in (fixed-time-window 10 (smap folds/mean) graph-default)
(graph-default is just my helper that calls influxdb with right options).
We've also had different error (altho that one doesn't show up as reliably):
unable to parse 'protocols,aggregate=max,host=efikom116.non.3dart.com,plugin_instance=Tcp,type=protocol_counter,type_instance=CurrEstab value=� 1503596468000000000': invalid boolean\n
which seems to be related to influxdata/influxdb-java#39
but on wire (tcpdump) it looks like that
protocols,aggregate=mean,host=efikom116.non.3dart.com,plugin_instance=TcpExt,type=protocol_counter,type_instance=TCPTimeouts value=... 1503597973000000000
value morphed to ...
somehow...
Curiously enough, that one happened only when I restarted source of events (collectd) and it sent all plugin output at once and stopped when I (re-)added batching (via (batch 10000 10 graph)
)
edit : i was wrong, new explanations soon
(I was wrong in my previous message.)
Hi,
I completely refactored the influxdb stream in Riemann 0.2.13. Before 0.2.13, Riemann used its own method to construct the influxdb messages. Now, Riemann uses the official influxdb java client.
I was able to reproduce the first error (but inverted, float instead of int).
Riemann uses this function (deprecated btw) to construct the Influxdb Point object.
As you can see, the value is always converted to double.
Before Riemann 0.2.13, it was the same thing (cf here):
riemann.bin> (clojure.pprint/cl-format nil "~F" 0)
"0.0"
In Riemann, the :metric
field for async-queue rejected rate is calculated like this:
:metric (/ drejected dtime)
With basically drejected = 0 when you don't have rejected events.
But...
riemann.service> (/ 0 (unix-time))
0N
riemann.service> (type (/ 0 (unix-time)))
clojure.lang.BigInt
riemann.service> (instance? BigInteger (/ 0 (unix-time)))
false
=> The field is not converted to float.
We should probably converts this field to something else in Riemann (double ?).
As a workaround, you can converts it yourself using smap on rejected rate$
events.
Regarding the second error, that's strange. Do you have the exact Riemann event generating the tcpdump line ?
Hmm interesting. I did wonder why it only occured with that one event, not dozens of other 0
metrics received from collectd.
Regarding the second error, that's strange. Do you have the exact Riemann event generating the tcpdump line ?
It's not an event but series of events. I will try to compile some testcase for it. Should I make separate ticket for it ?
Hmm interesting. I did wonder why it only occured with that one event, not dozens of other 0 metrics received from collectd.
It's because 0 metrics from collectd are Long or Integer, and not clojure.lang.BigInt.
Should I make separate ticket for it ?
No, you can use this issue i think ;)
Got it, seems to be what happens if it gets NaN as metric:
INFO [2017-08-29 07:26:51,499] defaultEventExecutorGroup-2-1 - riemann.config - {:description nil, :tags [collectd], :service protocols-TcpExt/protocol_counter-TCPOrigDataSent, :time 1503984399, :type protocol_counter, :host nuc-efikom116, :ttl 30.0, :plugin_instance TcpExt, :aggregate mean, :type_instance TCPOrigDataSent, :plugin protocols, :metric NaN}
WARN [2017-08-29 07:26:51,502] defaultEventExecutorGroup-2-1 - riemann.streams - riemann.influxdb$influxdb_deprecated$streams__9726@afdc1c8 threw
java.lang.RuntimeException: {"error":"unable to parse 'protocols-TcpExt/protocol_counter-TCPOrigDataSent,aggregate=mean,host=nuc-efikom116,plugin=protocols,plugin_instance=TcpExt,type=protocol_counter,type_instance=TCPOrigDataSent value=� 1503984399000000000': invalid boolean"}
generated by
(def graph-default
(with {:ds_type nil :ds_index nil :state nil}
#(info %)
(influxdb influxdb-creds-default)
)
)
InfluxDB doesn't seem to handle NaNs in a any way.
Just dropping metics with Double/NaN
seems to fix it, as is switching collectd StoreRates
to false (that is "calculate rate of counters before sending to riemann"). It seems that CollectD, when configured to calculate rate, sends NaN for every counter it sees for first time and dont have second datapoint to calculate rate from.
Thank you for investigating ;)
My PR #849 should fix your first issue (i convert BigInt to double in the influxdb stream).
Thanks. Second one seems to be issue with InfluxDB so I doubt it is worth fixing it here (except maybe warning in docs that influxdb doesn't like NaNs)
If it's OK for you, can we close this issue ?
yeah sure, thanks for help :)
Hi @Sravan0124 whatever the image is - it did not get correctly uploaded.