jeffpierce/cassabon

Metric Manager: Cassabon does not return nulls when a stat is missing from a sequence

Closed this issue · 10 comments

This may be the cause of the graph issues we're seeing more than anything else. Improve the query to grab timestamps as well as stats, and place nulls in during steps where there is no stat.

I have a PR in progress that fixes this issue, but there is a related issue that needs resolution.

Given the 60-second series:

row:    45.40740741 2015-10-15 22:25:00 +0000 UTC
row:    77.40000000 2015-10-15 22:25:12.526 +0000 UTC
row:    40.00000000 2015-10-15 22:25:31.654 +0000 UTC
row:    58.60629921 2015-10-15 22:26:00 +0000 UTC

Cassabon was hard restarted twice in the interval between 22:25:00 and 22:26:00. A partial entry for 22:26:00 was written at each restart, as well as the expected entry on the minute mark.

If we leave these extra entries, we skew the time. If we remove them, we drop data.

Possible solutions:

  • Stop writing out partial entries on restarts
  • Drop partial entries after the first one seen, including the end-of-interval partial entry
  • Make an attempt to coalesce the data, using the rollup definitions

Any preferences?

If we can coalesce the data from existing data + rollup definitions, that's probably the cleanest way to handle it.

I'm also okay at just not writing out partial entries on restarts, and "re-bootstrapping" the longer rollup windows from the shorter ones we've written to the database on startup.

Heck, we could just dump everything to a state file on a restart/sighup, and reload from that if we had to.

To be clear, we only generate partial entries when we do a hard restart, or the peer list changes. SIGHUP does not cause partial entries.

Sum, min, max, and count are easy to coalesce.

Average is hard, because we no longer have the number of entries averaged. Taking the mean of two averages could be wildly wrong.

For the smallest rollup in a series, I think we just have to accept we lose that data. This should be seconds anyway and insignificant in the long run.

For the rollups after that, we can always rebuild the buckets from the smaller rollups, which would give us a count and stats to do an average off of, along with the other aggregations.

Reading from the smaller rollups still loses the count. Assume you have entries accumulated from the following data:

Avg(1, 1, 1, 1, 1, 20, 1) = 4.33
Avg(1, 20) = 10.5
Avg (4.33, 10.5) = 7.42

But
Avg(1, 1, 1, 1, 1, 20, 1, 1, 20) = 5.22

No matter what, we get an anomaly. I suppose we could special-case averages, and throw out partial entries (show zero instead of garbage). The rest we can deal with more or less accurately.

That works for me.

This is still outstanding today, Tuesday Oct 20.

For reasoning with, here's the current rollup configuration:

rollups:
    default:
        retention:
            - 6s:6h
            - 1m:7d
            - 1h:30d
            - 6h:365d
        aggregation: average

I think it's safe to say that managing partial entries for six-second intervals that last for 6 hours is both difficult and meaningless, as that's pretty close to how long it might take to do a restart anyway (incurring loss of one interval for sure). So for short intervals, the right thing to do is to refrain from writing out the partial entries.

At the one-minute interval, it would be reasonable to start combining partial entries made before and after the hard restart, and end up with reasonably representative values. The exception is the calculation of averages, so the best rule might be to never write partial entries for averages.

Average calculation note: The code should be changed to do the average calculation only once, at write time.

Resolved by ad8e30e