segmentio/parquet-go

Control row group size

Closed this issue · 4 comments

Hello, I haven't found how to control row group size.

Yes, I can call Flush, but how do I know if row group reached limit size (1GB for example) ?

Hello @yonesko

There is currently no control of the row group size in bytes. Since parquet columns are encoded and compressed, I would like to ask what size you would need to control: would it be the compressed size of the row group on disk, or the total decoded size?

Compressed size of row group

Do you mind providing a bit more context on the use case that would require controlling the size of a row group on disk?

We have a big (4GB) parquet file with one row group, and Amazon Athena fails to read with "GENERIC_INTERNAL_ERROR: integer overflow"
We can limit RG by rows number and error disappeared