NVIDIA/NeMo-Curator

Fuzzy dedup error if partition wise indices do not start from 0

ayushdg opened this issue · 0 comments

Describe the bug

By default when reading from json/parquet files, unless an index is specified, Curator typically reads in each partition with an index ranging from 0->len(partition). However for dataframes where this is not the case, Fuzzy dedup might fail.

Steps/Code to reproduce bug

Reproducer in #46 tests, root cause seems to be coming from

left_df["_partitions"] = global_partitioning_index % parts_per_bucket_batch
where the lhs df might have different indices but the rhs starts from 0 resulting in assignment.

Expected behavior

No errors