influxdata/flux

Add previousN and nextN to range

pauldix opened this issue ยท 12 comments

It's very common for users to want to get a range of data, but to also pull back some number of points before or after the time range specified. For example, query from 2018-06-15T11:47:00 to now and include the 3 points that occur before the start time. The default values for previousN and nextN should be 0. The function signature would look like:

range = (start, stop=now(), previousN=0, nextN=0)

This would be a great addition and it would solve issue#6878 as mentioned by @vdwpsmt. Without it storing data with irregular intervals (think: events) using influxdb makes no sense.

openHAB is using following as workaround:

from(bucket:"openhab_db/autogen")
	|> range(start:-100y, stop:2021-10-23T22:00:00.000000000Z)
	|> filter(fn: (r) => r["_measurement"] == "ElectricMeter_Total")
	|> sort(desc:true, columns:["_time"])
	|> limit(n:1, offset:0)

start:-100y ๐Ÿ‘Ž
You can imagine the performance on large databases -> timeout.

Using the following influxdb1 query was much more performant:

SELECT "value"::field,"item"::tag FROM autogen.ElectricMeter_Total WHERE time <= '2021-10-23T22:00:00Z' ORDER BY time DESC LIMIT 1;

Suppose the size of the "ElectricMeter_Total" measurement is 1GB for the last 100y. Would the query above indeed have to browse through and sort 1GB of data before getting to the result?? That doesn't sound very efficient. Surely there is some optimization build in InfluxDB that prevents such inefficiencies?

This feature request would also help to fix the values that difference/derivative return. See #5174

I have tried to solve this with:

prefix = from(bucket: "Home")
  |> range(start: 0, stop: v.timeRangeStart)
  |> filter(fn: (r) => r["_measurement"] == "Zigbee")
  |> filter(fn: (r) => r["_field"] == "Temperature")
  |> last()
  |> map(fn: (r) => ({r with _time: v.timeRangeStart}))
data_range = from(bucket: "Home")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Zigbee")
  |> filter(fn: (r) => r["_field"] == "Temperature") 


union(tables: [prefix, data_range])   
  |> group(columns:["Device", "EndPoint","_field","_measurement"])

But notice the map(...) after last(), it's invalid because v.timeRangeStart is a time , and for some reason flux expects it to be a duration:

 type error @13:24-13:34: expected duration but found time

So asside of that it's wildly inefficient to go through the last 50+ years of data (1970 - 2202), it also doesn't work.
If i omit the map(), the value is timestamped the last day, month or even year the value changed before the selected range. and most visualization tools (grafana, flux dashboards), include it inside the view.

And if window fuctions were to be used after the group(...), they'd generate a lot of window segements in that inteval before timeRangeStart

So there really needs to be an efficient way to include the previous value in the query. And the proposed change here looks to be the most efficient one.

@SGStino
Your solution is not half bad I think.
If you supply a measurement AT LEAST every 24h then you can replace the start of "0" by "-24h".
That would bring the performance to an acceptable level.

I want to mention that I made a time-series database myself (before starting to use InfluxDB).
I ran into the same problem but the file structure was organized in a way that just by taking the previous record you could find the last stored value. I have no idea how InfluxDB data is stored, but could it be that simple in InfluxDB too?

It's a workaround that works for some cases in grafana: if you select absolute timestamps, and the range isn't to big, because window functions can't be used.

I guess this should work with the 24h:

import "date"

prefix = from(bucket: "Home")
  |> range(start: date.sub(d: 24h, from: v.timeRangeStart), stop: v.timeRangeStart)
  |> filter(fn: (r) => r["_measurement"] == "Zigbee")
  |> filter(fn: (r) => r["_field"] == "Temperature")
  |> last()
  |> map(fn: (r) => ({r with _time: v.timeRangeStart}))
data_range = from(bucket: "Home")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Zigbee")
  |> filter(fn: (r) => r["_field"] == "Temperature") 

union(tables: [prefix, data_range])   
  |> group(columns:["Device", "EndPoint","_field","_measurement"])

however it seems to complain about "sub" from the package date:

"invalid: type error @4:19-4:23: record is missing label sub"

EDIT: nevermind, this works great (and fast) in 2.4, sub didn't exist in 2.0 and i didn't upgrade my test instance ...

I will definitely try that.
"...that works for some cases in grafana: if you select absolute timestamps, and the range isn't to big, because window functions can't be used." --> I don't understand this part. Can you explain this a bit more?

when you set the timerange to now-1d to now in the datepicker , the v.timeRangeStart contains -1d, which is a duration, not a time. and you can't set durations in time fields (e.g. _time = -1d)

But I seem to have been mistaken, Grafana doesn't do this, it's the Influx's built-in dashboard.
Grafana always uses absolute timestamps, even for relative ranges.

Is there a reason to map the last value to the beginning of the timerange?
Grafana seems to be able to consider points outside the graph and seems to draw the lines correct in my tests.

But maybe there's a limitation how far outside a value can be to be still used for the drawing engine ...

Btw. this is also useful with date.add on the timeRangeStop when looking at time ranges that doesn't stop at now().

One last point that needs to be solved to satisfy my perfect self:
Any ideas how to add a value at "now"( if the timespan is including it) with the last value?

This was quite easy in SQL and I start regretting my switch over to Flux. :P

Ok, that way it works for me.

For the sake of simplicity, I load the complete data of a few hours before and after the current timespan instead using multiple queries.
How long this is has to be adjusted on the sensor data. My temperature sensors report quite often, so I used 6h in this example.

I also requested data after the timespan in case you scrolled back in Grafana, to fill the end of the graph.

And at the end, I look for the last value that the sensor sent, and move it to "now()".
This way the lines get extended to the current moment if it is in the visible range.

import "date"

history = from(bucket: "homeassistant")
  |> range(start: date.sub(d: 6h, from: v.timeRangeStart), stop: date.sub(d: -6h, from: v.timeRangeStop))
  |> filter(fn: (r) => r["friendly_name"] == "Temperatur Wohnzimmer")
  |> filter(fn: (r) => r["_field"] == "value")

extend = from(bucket: "homeassistant")
  |> range(start: date.sub(d: 6h, from: now()), stop: now())
  |> filter(fn: (r) => r["friendly_name"] == "Temperatur Wohnzimmer")
  |> filter(fn: (r) => r["_field"] == "value")
  |> last()
  |> map(fn: (r) => ({r with _time: now()}))

union(tables: [history, extend])
  |> drop(columns: ["_start", "_stop"])