pola-rs/polars

Add ability to sink_parquet into io.BytesIO()

Opened this issue · 5 comments

s-b90 commented

Problem description

As I understand, sink_parquet initial purpose was to stream large DF to the disk.
A compressed parquet file may be much smaller in the memory than DF, so it may be useful to write data directly into the memory if we want to do something with this file in further.
e.g.:
df.sink_parquet(file: str | Path | BytesIO)

In such a case just write collect().write_parquet().

s-b90 commented

collect() stores whole DF into the memory, which is too expensive with large data, in my case it may consume up to 10gb of memory. At the same time, compressed parquet is about 150mb.

ghuls commented

On at least Linux, you should be able to use the following to write to shared memory:

df.lazy().sink_parquet("/dev/shm/temp.parquet")
s-b90 commented

Thanks, @ghuls, it may help!

@ghuls My use case for the above is in conjunction with fsspec where I have heterogeneous storage backends that I want to sink_parquet to. Do you have any suggestions on how to do this?