go-graphite/carbon-clickhouse

Change to suggested schema?

mzealey opened this issue · 15 comments

For graphite table, switching to the following moves us from 5 bytes/row to 0.6 bytes/row for a large dataset:

    `Path` LowCardinality(String),
    `Value` Float64,
    `Time` UInt32 Codec(DoubleDelta, LZ4),
    `Date` Date Codec(DoubleDelta, LZ4),
    `Timestamp` UInt32 Codec(DoubleDelta, LZ4)

toYYYYMMDD() might be a better partitioning key for large volumes of data

ClickHouse/ClickHouse#12144 (comment) also suggests removing the Date col - perhaps this could be an option (although when compressed as above only takes a very small amount of space)

I also got similar results when experimenting with compression and low cardinality, but I do not think it's universal enough and need to be recommended as generic solution.
Same for partitioning key, it's really depends on use case.
For Data column it's also very confusing, theoretically you need do exact opposite thing - get rid of Timestamp with zero-timestamp=true and use Date/Time for queries.

I'd have thought other than partitioning the suggested changes above should benefit everyone as Path will be repeated a lot and Time/Date/Timestamp are all usually increasing and with lots of duplicates

I mean, I think it's not possible to remove Date, it's used by graphite-clickhouse. You can zero Timestamp and have mostly same effect with is documented and recommended way. You can't remove Timestamp because it's required (maybe not anymore?) by Graphite Merge Tree engine.

See this extract from one of my servers:

SELECT
    column,
    any(type),
    formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    sum(rows)
FROM system.parts_columns
WHERE (table = 'graphite_data') AND active
GROUP BY column
ORDER BY column ASC

┌─column────┬─any(type)─┬─compressed─┬─uncompressed─┬──sum(rows)─┐
│ DateDate69.84 MiB  │ 12.24 GiB    │ 6569285540 │
│ Path      │ String    │ 4.60 GiB   │ 1.04 TiB     │ 6569285540 │
│ Time      │ UInt32    │ 17.00 GiB  │ 24.47 GiB    │ 6569285540 │
│ Timestamp │ UInt32    │ 19.38 GiB  │ 24.47 GiB    │ 6569285540 │
│ Value     │ Float64   │ 6.40 GiB   │ 48.94 GiB    │ 6569285540 │
└───────────┴───────────┴────────────┴──────────────┴────────────┘

The Date column has not much impact. The Path might be useful to have as LowCardinality, although it only has big impact in uncompressed format.

I just found out the setting for zero-timestamp, I will change it so that Timestamp will only contain zero value. And will check and compare again later. There is not much documentation on this setting, so I'm wondering why it is enabled by default and why it is needed?

Be aware, that LowCardinality impacts the size of the marks. Since it doubles it for Path, I've had a lot of OOMs during experiments with it. In ClickHouse telegram chat people have suggested me to use ZSTD, and it works quite well too. Here's my own experiments' final state:

ATTACH TABLE data_lr
(
    `Path` String CODEC(ZSTD(3)),
    `Value` Float64 CODEC(Gorilla, LZ4),
    `Time` UInt32 CODEC(DoubleDelta, LZ4),
    `Date` Date CODEC(DoubleDelta, LZ4),
    `Timestamp` UInt32 CODEC(DoubleDelta, LZ4)
)
ENGINE = ReplicatedGraphiteMergeTree('/clickhouse/tables/graphite.data_lr/{shard}', '{replica}', 'graphite_rollup')
PARTITION BY toYYYYMMDD(Date)
ORDER BY (Path, Time)
SETTINGS index_granularity = 256

And here's the data on the host result:

SELECT
    column,
    any(type),
    formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    sum(rows)
FROM system.parts_columns
WHERE (table = 'data_lr') AND active
GROUP BY column
ORDER BY column ASC

┌─column────┬─any(type)─┬─compressed─┬─uncompressed─┬───sum(rows)─┐
│ DateDate63.59 MiB  │ 48.97 GiB    │ 26289415098 │
│ Path      │ String    │ 9.01 GiB   │ 1.54 TiB     │ 26289415098 │
│ Time      │ UInt32    │ 980.84 MiB │ 97.94 GiB    │ 26289415098 │
│ Timestamp │ UInt32    │ 4.29 GiB   │ 97.94 GiB    │ 26289415098 │
│ Value     │ Float64   │ 60.51 GiB  │ 195.87 GiB   │ 26289415098 │
└───────────┴───────────┴────────────┴──────────────┴─────────────┘

@Hipska what do you use for the Value column?

4 months later and I have seen that compress ratio for column Timestamp has gone from 4-5 up to 220+

So again, why is zero-timestamp not set by default? Why is it useful to have an actual timestamp in it? I don't see the use-case here.

@Felixoid I didn't do anything special to the datamodel, I only added a TTL based on the Date column.

The Timestamp column is used internally by GraphiteMergeTree as Version column during the rollup process. I'm afraid, I already don't remember details of how it's used, but the doc tells:

Version of the metric. Data type: any numeric. ClickHouse saves the rows with the highest version or the last written if versions are the same. Other rows are deleted during the merge of data parts.

Here's a comment from the code https://github.com/ClickHouse/ClickHouse/fc42851/master/src/Processors/Merges/Algorithms/GraphiteRollupSortedAlgorithm.h#L78-L90

    /* | path | time | rounded_time | version | value | unmodified |
     * -----------------------------------------------------------------------------------
     * | A    | 11   | 10           | 1       | 1     | a          |                     |
     * | A    | 11   | 10           | 3       | 2     | b          |> subgroup(A, 11)    |
     * | A    | 11   | 10           | 2       | 3     | c          |                     |> group(A, 10)
     * ----------------------------------------------------------------------------------|>
     * | A    | 12   | 10           | 0       | 4     | d          |                     |> Outputs (A, 10, avg(2, 5), a)
     * | A    | 12   | 10           | 1       | 5     | e          |> subgroup(A, 12)    |
     * -----------------------------------------------------------------------------------
     * | A    | 21   | 20           | 1       | 6     | f          |
     * | B    | 11   | 10           | 1       | 7     | g          |
     * ...
     */

I don't think setting zero-timestamp=true by default is a good idea. The much more valuable change is using proper codecs, like Timestamp UInt32 CODEC(DoubleDelta, LZ4). As you see, it helps significantly to store the data more effectively.

SELECT
    column,
    any(type),
    formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    sum(column_data_uncompressed_bytes) / sum(column_data_compressed_bytes) AS ratio,
    sum(rows)
FROM system.parts_columns
WHERE ((table = 'data_lr') AND (database = 'graphite')) AND active
GROUP BY column
ORDER BY column ASC

┌─column────┬─any(type)─┬─compressed─┬─uncompressed─┬──────────────ratio─┬───sum(rows)─┐
│ DateDate59.17 MiB  │ 45.58 GiB    │   788.81312542812124468984334 │
│ Path      │ String    │ 9.42 GiB   │ 1.41 TiB     │ 153.1514570816546324468984334 │
│ Time      │ UInt32    │ 961.97 MiB │ 91.15 GiB    │   97.031511678716924468984334 │
│ Timestamp │ UInt32    │ 4.33 GiB   │ 91.15 GiB    │  21.0371487498284524468984334 │
│ Value     │ Float64   │ 55.59 GiB  │ 182.31 GiB   │  3.27941942487834224468984334 │
└───────────┴───────────┴────────────┴──────────────┴────────────────────┴─────────────┘

And one additional note, when I've tested the LowCardinality(String) as a Path column, it did consume the double amount of memory for the marks. And my servers were killed by OOM quite often, especially during the wide Time+Path requests. Usage of String CODEC(ZSTD(3)) is more reliable. At least, it was the state of Sep 2019. Not sure how it's now.

So, I still don't see use case to have anything other than 0 in Timestamp col. Why would there be a case that there are multiple rows on the same Path and Time and different value? It seems to me those would be very specific use-case that should not be the reason for a default setting that uses more storage than needed?

Because neither I nor you know the internals of GraphiteMergeTree engine, I'd say. And I feel uncomfortable making a decision for everybody to change the default behavior.

Besides that, with the proper codec, it doesn't hurt so much. Plus, it just came to my mind, it would make sense to add a custom TTL to the Timestamp column. Something pretty big to be sure, that rollup is done at least once. Then it will be pretty insignificant.

bzed commented

But instead of changing the table schema, is there an issue in using toYYYYMMDD() for partitioning? Optimizing > 1.5TB partitions down to ~100G just needs an useless amount of empty space. (Does anybody know why clickhouse requires the 100% overhead?)

It's something I can answer

But instead of changing the table schema, is there an issue in using toYYYYMMDD() for partitioning?

It depends more on the amount of data. Partitions bring overhead. And if you have broad historical inserts (like over a year), then the memory consumption for such batches would be huge. On the other hand, the optimizing overhead is minimized, so one should find a balance for himself. I personally have used partitions for 3 days, but a half-year ago migrated to toYYYYMMDD() and don't see issues. So it sounds like a solution I can suggest.

Does anybody know why clickhouse requires the 100% overhead?

It's inherited from the MergeTree engine. The 100% space allocation required to be safe during merging all parts together. The server can't know in advance how much space may be really needed and how aggressive aggregation will be, so can't predict the necessary allocation more precisely than just taking an existing size as a constant rule of thumb.

bzed commented

It's something I can answer

But instead of changing the table schema, is there an issue in using toYYYYMMDD() for partitioning?

It depends more on the amount of data. Partitions bring overhead. And if you have broad historical inserts (like over a year), then the memory consumption for such batches would be huge.

Basically I'm inserting ~400k rows/s with data from the last 10..300 seconds. Old data is rarely inserted.

On the other hand, the optimizing overhead is minimized, so one should find a balance for himself. I personally have used partitions for 3 days, but a half-year ago migrated to toYYYYMMDD() and don't see issues. So it sounds like a solution I can suggest.

Thanks, appreciated.

Hi guys, would graphite-clickhouse benefit from any projections? https://www.youtube.com/watch?v=jJ5VuLr2k5k

Hi guys, would graphite-clickhouse benefit from any projections? https://www.youtube.com/watch?v=jJ5VuLr2k5k

Yes, it definitely looks interesting. Looks like it brings the feature to define the only one table for both direct and reversed points at once. Thank you for popping it up! 👍