NVIDIA/NeMo-Curator

[BUG] Jaccard Shuffle error if shuffled_docs.parquet data already exists and has been written.

ayushdg opened this issue · 0 comments

Describe the bug

Calling jaccard_shuffle on an output directory that already contains shuffle docs from a previous run leads to errors

 assert bucket_part_start_offset % parts_per_bucket_batch == 0
AssertionError