Add ability to sink_parquet into io.BytesIO()
Opened this issue · 5 comments
Problem description
As I understand, sink_parquet initial purpose was to stream large DF to the disk.
A compressed parquet file may be much smaller in the memory than DF, so it may be useful to write data directly into the memory if we want to do something with this file in further.
e.g.:
df.sink_parquet(file: str | Path | BytesIO)
In such a case just write collect().write_parquet()
.
collect() stores whole DF into the memory, which is too expensive with large data, in my case it may consume up to 10gb of memory. At the same time, compressed parquet is about 150mb.
On at least Linux, you should be able to use the following to write to shared memory:
df.lazy().sink_parquet("/dev/shm/temp.parquet")
@ghuls My use case for the above is in conjunction with fsspec
where I have heterogeneous storage backends that I want to sink_parquet
to. Do you have any suggestions on how to do this?