dask/fastparquet

Parquet files can't exceed 2.14GB? Write throws overflow errors when filesize in bytes exceeds int32 limit...

sikanrong opened this issue · 4 comments

I'm having an issue storing a large dataset (around 40GB) in a single parquet file. I'm using fastparquet to append pandas.DataFrames to this parquet dataset file.

The following is a minimal example program that appends chunks to a parquet file until it crashes as the file size in bytes exceeds the int32 threshold of 2147483647 (2.1GB):

Link to minimum reproducible example code

Everything goes fine until the dataset hits 2.1GB, at which point I get the following errors:

OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'

Because the exception is ignored internally, it's very hard to figure out which specific thrift it's upset about on write and get a stack trace to provide here. However, it's very clear that the failure is directly linked to the file size in bytes exceeding the int32 range.

Also these thrift definitions come from the parquet-format repo itself, so I wonder if this is a limitation built into the design of the parquet format, or if it's an error in fastparquet...

I opened a related issue about this yesterday, but now I've been able to condense it into a very concise program that reproduces the issue. I feel like everyone keeps assuring me that this program shouldn't crash when the file hits 2.14GB, and that parquet format has no inherent size limitations, and yet it does crash in such a way that suggests those exact limitations. Someone on the fastparquet team should run this code and ensure that it doesn't represent an issue with the library.

I'm on Python 3.9.0 and fastparquet==2022.11.0

@martindurant Hi again! we talked about this yesterday. I've provided some dead-simple code so that you can reproduce this issue yourself; I really feel like the code I've provided ought to work without issue.

Thanks for the reproducer, that's very helpful. You may want to try with #824 before I get around to trying with your code.

@martindurant so in the example (but is also the case in my real code) all that's being written are float32 values. In the example code, they're all set to 1, for simplicity

Perfect - fixed in 7961b11 (note the one-liner!)