[Data] `num_rows_per_file` doesn't work with small values

Question

[Data] `num_rows_per_file` doesn't work with small values

Opened this issue a month ago · 2 comments

What happened + What you expected to happen

My dataset contains 28k rows. I tried writing this to Parquet with num_rows_per_file=700. I expected several Parquet files each with 700 rows, but instead got a single file with all the rows.

Versions / Dependencies

4d37e55

Reproduction script

import os

import ray

ray.data.range(100, override_num_blocks=1).write_parquet(
    "/tmp/results", num_rows_per_file=10
)


print(os.listdir("/tmp/results"))

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Answer 1 · 2024-06-03T17:25:46.000Z

@bveeramani assuming this is a won't-fix, I'll close this issue

Answer 2 · 2024-06-03T17:35:06.000Z

actually this should be fixed