Add ability to sink_parquet into io.BytesIO()

Question

Add ability to sink_parquet into io.BytesIO()

Opened this issue 2 years ago · 5 comments

Problem description

As I understand, sink_parquet initial purpose was to stream large DF to the disk.
A compressed parquet file may be much smaller in the memory than DF, so it may be useful to write data directly into the memory if we want to do something with this file in further.
e.g.:
df.sink_parquet(file: str | Path | BytesIO)

Answer 1 · 2023-03-15T11:25:36.000Z

In such a case just write collect().write_parquet().

Answer 2 · 2023-03-15T11:35:49.000Z

collect() stores whole DF into the memory, which is too expensive with large data, in my case it may consume up to 10gb of memory. At the same time, compressed parquet is about 150mb.

Answer 3 · 2023-03-15T13:10:36.000Z

On at least Linux, you should be able to use the following to write to shared memory:

df.lazy().sink_parquet("/dev/shm/temp.parquet")

Answer 4 · 2023-03-16T09:04:56.000Z

Thanks, @ghuls, it may help!

Answer 5 · 2024-07-13T08:35:34.000Z

@ghuls My use case for the above is in conjunction with fsspec where I have heterogeneous storage backends that I want to sink_parquet to. Do you have any suggestions on how to do this?