[Go][Parquet] Inconsistency in RowGroup 'total byte size' field values between implementations
Closed this issue · 5 comments
Describe the usage question you have. Please include as many useful details as possible.
parquet.thrift
in parquet-format repo describes RowGroup
total_byte_size
field meaning as
Total byte size of all the uncompressed column data in this row group
This is also the case for C++ implementation of parquet in arrow repo.
But in case of Go implementation total_byte_size
is described as
TotalByteSize is the total size of this rowgroup on disk
The difference between these values can be large, when compression is applied to column chunks.
My question is: Is that intentional inconsistency with format definition and other implementations? And if so, why does this distinction has been made?
Component(s)
Go, Parquet
P.S.:
This issue is duplicate of apache/arrow#44205, but as I see this repo is now main location of Arrow Go
Honestly, I think it was just a misunderstanding on my part when I wrote this and no one else noticed or caught it.
It seems like an easy fix honestly, aside from the comment, it should be fairly easy to fix the RowGroupMetaDataBuilder
so that it uses the correct value to write for it.
I'll try to get to this before the end of this week, unless you'd like to take a stab at it yourself 😄
Thanks for reply! I'll try to take a look, but no promises :)
So looking at the C++ implementation again, it looks like despite the comment saying that total_byte_size
is the uncompressed size of all the columns, it is actually populating it with the total_bytes_written, i.e. the total size on disk. Given that I based much of the Go implementation on the C++ implementation, it's not surprising that this bug exists in both.
We should file an equivalent issue on the Arrow repo for the C++ parquet implementation. In the meantime i'll fix the Go impl
This is strange, because from my experiments with different parquet implementations (including Go and C++ through Python from arrow repo) I've seen that C++ one writes correct total_byte_size
.
Also thank you for such quick fix!