apache/arrow-go

[Go][Parquet] Inconsistency in RowGroup 'total byte size' field values between implementations

Closed this issue · 5 comments

Describe the usage question you have. Please include as many useful details as possible.

parquet.thrift in parquet-format repo describes RowGroup total_byte_size field meaning as

Total byte size of all the uncompressed column data in this row group

This is also the case for C++ implementation of parquet in arrow repo.

But in case of Go implementation total_byte_size is described as

TotalByteSize is the total size of this rowgroup on disk

The difference between these values can be large, when compression is applied to column chunks.

My question is: Is that intentional inconsistency with format definition and other implementations? And if so, why does this distinction has been made?

Component(s)

Go, Parquet

P.S.:
This issue is duplicate of apache/arrow#44205, but as I see this repo is now main location of Arrow Go

Honestly, I think it was just a misunderstanding on my part when I wrote this and no one else noticed or caught it.

It seems like an easy fix honestly, aside from the comment, it should be fairly easy to fix the RowGroupMetaDataBuilder so that it uses the correct value to write for it.

I'll try to get to this before the end of this week, unless you'd like to take a stab at it yourself 😄

Thanks for reply! I'll try to take a look, but no promises :)

So looking at the C++ implementation again, it looks like despite the comment saying that total_byte_size is the uncompressed size of all the columns, it is actually populating it with the total_bytes_written, i.e. the total size on disk. Given that I based much of the Go implementation on the C++ implementation, it's not surprising that this bug exists in both.

We should file an equivalent issue on the Arrow repo for the C++ parquet implementation. In the meantime i'll fix the Go impl

This is strange, because from my experiments with different parquet implementations (including Go and C++ through Python from arrow repo) I've seen that C++ one writes correct total_byte_size.

Also thank you for such quick fix!

Closed by #144