Internal aggregation is using incorrect from and until unix timestamps when maxDataPoint is used
iankosen opened this issue · 7 comments
I am noticing a potential problem with the values that are used for from and until when maxdatapoints is provided. Below is the query I am running and my analysis of what is happening.
I am sending the query from carbon-api. The query is:
http://localhost:8081/render?format=json&target=metric_name&from=-30d&maxDataPoints=1
The From and Until values resolve to 1694837820(Sat Sep 16 2023 04:17:00 GMT+0000)
and 1697429820(Mon Oct 16 2023 04:17:00 GMT+0000)
respectively.
In this section we recompute from and until so that it's divisible by step
graphite-clickhouse/render/data/query.go
Line 349 in 1db4b3d
After the calculation is done for the graphite query above the from and until values are updated to 1695168000(Wed Sep 20 2023 00:00:00 GMT+0000)
and 1697759999(Thu Oct 19 2023 23:59:59 GMT+0000)
respectively. The value of step is 2592000(one month)
.
From the above we can see that the from and until values are off by about 4days. This leads to incorrect values being aggregated. The problem gets worse as the range between from and until increases.
The only reason for this I see it's the logical error with the storage retention. Data point precision must be 1day or less
precision – How precisely to define the age of the data in seconds. Should be a divisor for 86400 (seconds in a day).
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/graphitemergetree#patterns
The retention policy I am using is:
<clickhouse>
<pattern>
<regexp>regex_exp</regexp>
<function>sum</function>
<retention>
<age>0</age>
<precision>15</precision>
</retention>
<retention>
<age>108000</age>
<precision>30</precision>
</retention>
<retention>
<age>3456000</age>
<precision>300</precision>
</retention>
<retention>
<age>8640000</age>
<precision>1800</precision>
</retention>
<retention>
<age>34128000</age>
<precision>3600</precision>
</retention>
</pattern>
</clickhouse>
All the precision values are less than a day and are a divisor for 86400.
Then we need to investigate this place https://github.com/go-graphite/graphite-clickhouse/blob/1db4b3d740c0f3e72ed8a3449603f4a016b5dda8/render/data/query.go#L147C9-L147C9
It calculates the common step before the from
/until
is set, so let's add some trace logging there.
Also, what does the query in CH return?
SELECT * FROM system.graphite_retentions
Do you use auto
for rollup-config?
I've totally missed the initial query, sorry.
http://localhost:8081/render?format=json&target=metric_name&from=-30d&maxDataPoints=1
You get what you requested. Using maxDataPoints=100
will change it to another step.
Then we need to investigate this place https://github.com/go-graphite/graphite-clickhouse/blob/1db4b3d740c0f3e72ed8a3449603f4a016b5dda8/render/data/query.go#L147C9-L147C9
It calculates the common step before the
from
/until
is set, so let's add some trace logging there.Also, what does the query in CH return?
SELECT * FROM system.graphite_retentions
Do you use
auto
for rollup-config?
I use rollup-config = auto. I dont think the rollup config has an issue. The rStep it gets here is always correct.
I've totally missed the initial query, sorry.
http://localhost:8081/render?format=json&target=metric_name&from=-30d&maxDataPoints=1
You get what you requested. Using
maxDataPoints=100
will change it to another step.
That is correct. Calculating the step works as expected. The only issue I have is the update it does to the from and until dates when maxDataPoints is used. You can see from the first message the from and until values for the query are 4 days ahead after setFromUntil is executed. The issue is more obvious when Until - From is large (recommend at least 30days) and maxDataPoints is small e.g 1
c.From = 1694837820
c.Until = 1697429820
c.step = 2592000
# computed from and until in setFromUntil function will be:
c.from = 1695168000 # (four days after original c.From)
c.until = 1697759999 # (four days after original c.Until)
This causes the clickhouse to do an aggregation over the wrong time range.
It's aligned as expected. Totally correct result.
In [4]: 1694837820 // 2592000
Out[4]: 653 # the timestamp 1692576000 is out of the range originalFrom:originalUntil
In [5]: 2592000 * 654 # to get the ANY point after the original From
Out[5]: 1695168000
In [6]: 2592000 * 655
Out[6]: 1697760000 # that's Until+1, to not get the second point
The purpose of the maxDataPoints is to limit the points. So the step, from, and until are calculated accordingly.
If you feel it's broken, it's not.
I can continue to describe how it works and why, but the behavior itself is correct and aligned between graphite-clickhouse and carbonapi
The calculation of step/from/until
worked like this for 3 years and was reviewed by other maintainers. And it's tested by many people and companies. Changing it will break the world.