ChenghaoMou/text-dedup

Suffix array clean up

KeremTurgutlu opened this issue · 3 comments

I think here instead of removing slices from the string array we should remove from the byte array since suffix array deduplication code returns byte pair pointers. Since a utf-8 character can be represented by 1-4 bytes this can lead to errors with text that have at least 1 character with > 1 bytes.

Maybe something like this

def clean_up(text, remove_pairs):
    """
    Remove duplicate substrings from the text.
    Parameters
    """
    byte_array = np.array(list(text.encode()))
    for start, end in remove_pairs:
        byte_array[start:end] = -1 # byte must be between 0-256
    return bytearray([byte for byte in byte_array if byte != -1]).decode("utf-8", "ignore") 

Let me know if I am missing something regarding suffix_array.py code which might already handle this.

Good catch. Thanks for pointing it out, I have fixed it in the latest commit.

No worries, this is the least I can do to contribute back. Also, here is a minor addition if you like:

Motivation is that there might be cases where k length duplicate text starts at one document and finishes in another. In order to avoid that I believe original repo includes a unique separator.

A similar thing can be done in this repo:

with timer("Preprocessing"):
    offsets: List[slice] = []
    start = 0
    with open(Path(args.google_repo_path) / temp_text, "wb") as f:
        for UID, doc in enumerate(ds):
            SEP = b"\xff\xff" + struct.pack("<I", UID)
            doc_bytes = SEP + doc[args.column].encode("utf-8")
            end = start + len(doc_bytes)
            offsets.append(slice(start, end))
            start = end
            f.write(doc_bytes)
def clean_up(text, slices, UID):
    """
    Remove duplicate substrings from the text.
    Parameters
    """
    SEP_LENGTH = len(b"\xff\xff" + struct.pack("<I", UID))
    byte_array = np.array(list(text.encode()))
    for s in slices:
        byte_array[s.start-SEP_LENGTH:s.stop-SEP_LENGTH] = -1 # byte must be between 0-256
    return bytearray([byte for byte in byte_array if byte != -1]).decode("utf-8", "ignore")

If I read this correctly, this is not needed in my implementation, actually. It keeps track of the document byte level boundaries (offsets) in the beginning code, before merging everything into one giant string, and will restore them when generating those slices, so each generated slice respects the document boundary without any special token.