[Data] `num_rows_per_file` doesn't work with small values
Opened this issue · 2 comments
bveeramani commented
What happened + What you expected to happen
My dataset contains 28k rows. I tried writing this to Parquet with num_rows_per_file=700
. I expected several Parquet files each with 700 rows, but instead got a single file with all the rows.
Versions / Dependencies
Reproduction script
import os
import ray
ray.data.range(100, override_num_blocks=1).write_parquet(
"/tmp/results", num_rows_per_file=10
)
print(os.listdir("/tmp/results"))
Issue Severity
Medium: It is a significant difficulty but I can work around it.
raulchen commented
@bveeramani assuming this is a won't-fix, I'll close this issue
raulchen commented
actually this should be fixed