Reduce `__sequence` field size in parquet files

Question

Reduce `__sequence` field size in parquet files

Opened this issue a month ago · 0 comments

What type of enhancement is this?

Refactor

What does the enhancement do?

In our Parquet file analysis, the __sequence field occupies a disproportionate amount of file size, accounting for approximately 67% of the total size. This results in inefficient storage usage and potential performance bottlenecks.

File: 9bc23ce8-7046-4ff8-a209-1245827a7a89.parquet

Column Name	Size (Bytes)	Size (Ratio)
`__op_type`	54,825	0.00016 (0.016%)
`greptime_value`	39,894,514	0.117 (11.75%)
`__sequence`	228,302,552	0.672 (67.23%)
`__primary_key`	18,000,415	0.053 (5.30%)
`greptime_timestamp`	53,318,216	0.157 (15.70%)

The __sequence field clearly dominates the file size, overshadowing other important columns such as greptime_value and greptime_timestamp.

Implementation challenges

No response