google-research/deduplicate-text-datasets

remove_ex in finish_dedup_wiki40b

Closed this issue · 1 comments

Thanks for your excellent code.

I have successfully rerun the code in the repository about exactdedup. However, I have a problem about the following code:

159        remove_ex[i].append((max(int(remove[ptr][0] - byte_start - 6), 0),
160                              min(int(remove[ptr][1] - byte_start), byte_end-byte_start)))

I know the meaning of "6", but why not also subtract "6" in the right?

We need to start 6 bytes off of the start (that's the left side) but the end is still the last byte; that doesn't need to change or be offset by 6 bytes.