NVIDIA/NeMo-Curator

[BUG] Jaccard Shuffle error if merge result is empty

ayushdg opened this issue · 2 comments

Describe the bug

If the merge result b/w text and bucket mapping df is empty for any iteration the logic fails.

Failure is observed here but originates from

being empty.
Still working on a minimal repro.

Additional context

The fix should be to continue on with the loop if this is a 0 len df.

Error here looks like ValueError: zero-size array to reduction operation maximum which has no identity

Will be assigning to Vibhu to investigate.

@VibhuJawa has a fix that needs testing.