:nif_panicked "Chunk require all its arrays to have an equal number of rows"
mlineen opened this issue · 1 comments
I have a large (~771M, 7,421,520 row, 78 column) parquet file from a vendor that I'm able to read in with Explorer.DataFrame.from_parquet
, but I am unable to take the loaded data frame and dump/write with Explorer.DataFrame.dump_parquet
nor Explorer.DataFrame.to_parquet
When I try, I get (ErlangError) Erlang error: :nif_panicked
on polars-arrow-0.38.3/src/chunk.rs:20:31
If I run as_single_chunk
in rust code, I am able to dump/write the file.
If I run set_rechunk
in rust code, when reading the file, I am able to dump/write the file.
Would the project be open to adding a DataFrame.as_single_chunk
method and/or adding set_rechunk
as an option to DataFrame.read_parquet
? What does adding either of these methods mean in the context of LazyFrame backend?
Would anyone have a good idea of how to generate synthetic data that would exhibit this issue, as I cannot pass along the file I have?
I think we should:
- Automatically rechunk the file on to_parquet/dump_parquet
- Also add a
rechunk: true | false
option onread_parquet
as that may affect performance
PRs are definitely welcome although I am not sure we could test this trivially. :( PRs would be welcome regardless.