Reduce `__sequence` field size in parquet files
Opened this issue · 0 comments
WenyXu commented
What type of enhancement is this?
Refactor
What does the enhancement do?
In our Parquet file analysis, the __sequence
field occupies a disproportionate amount of file size, accounting for approximately 67% of the total size. This results in inefficient storage usage and potential performance bottlenecks.
File: 9bc23ce8-7046-4ff8-a209-1245827a7a89.parquet
Column Name | Size (Bytes) | Size (Ratio) |
---|---|---|
__op_type |
54,825 | 0.00016 (0.016%) |
greptime_value |
39,894,514 | 0.117 (11.75%) |
__sequence |
228,302,552 | 0.672 (67.23%) |
__primary_key |
18,000,415 | 0.053 (5.30%) |
greptime_timestamp |
53,318,216 | 0.157 (15.70%) |
The __sequence
field clearly dominates the file size, overshadowing other important columns such as greptime_value
and greptime_timestamp
.
Implementation challenges
No response