google-research/deduplicate-text-datasets

Simple test

KeremTurgutlu opened this issue · 0 comments

Thanks a lot for open sourcing your amazing work!

I was just getting my hands dirty with the code and wrote a simple test:

from text_dedup.exact_dedup import GoogleSuffixArrayDeduplicator

k=3
deduplicator = GoogleSuffixArrayDeduplicator(k=k, google_repo_path="deduplicate-text-datasets")
texts = ['aaaaaaaaaaaabbbccccc', 'aaaaaaaaaaaaccccc']
slices = deduplicator.fit_predict(texts)

print(f"k:{k}")
print(slices)

def remove_slice_list(text, slice_list):
    offset = 0
    for s in slice_list:
        text = text[:(s.start-offset)] + text[(s.stop-offset):]
        offset += s.stop - s.start
    return text

for slice_list,text in zip(slices, texts):
    if slice_list != []:
        print(f"{text} -> {remove_slice_list(text, slice_list)}")

# which prints:
[[slice(0, 12, None)], [slice(0, 12, None)]]
aaaaaaaaaaaabbbccccc -> bbbccccc
aaaaaaaaaaaaccccc -> ccccc

Shouldn't the c's from both strings get removed as well? Maybe it might be due to my unfamiliarity with the algorithm and just curious.

# expected
aaaaaaaaaaaabbbccccc -> bbb
aaaaaaaaaaaaccccc -> 

For example, I imagine in a real world scenario we would like to remove both the repeating headers and footers of a website.

Also, tried the example from the README:

k=4
deduplicator = GoogleSuffixArrayDeduplicator(k=k, google_repo_path="deduplicate-text-datasets")
# texts = ['eabcdfgh . efabcdgh']
texts = ['abcdefgh . efabcdgh']
slices = deduplicator.fit_predict(texts)

print(f"k:{k}")
print(slices)

clean_texts = []
for slice_list,text in zip(slices, texts):
    if slice_list != []:
        clean_text = remove_slice_list(text, slice_list)
        print(f"{text} -> {clean_text}")
        clean_texts.append(clean_text)
    else:
        clean_texts.append(text)

texts = clean_texts
slices = deduplicator.fit_predict(texts)
print(slices)
for slice_list,text in zip(slices, texts):
    if slice_list != []:
        clean_text = remove_slice_list(text, slice_list)
        print(f"{text} -> {clean_text}")

# prints
[[slice(0, 4, None), slice(13, 17, None)]]
abcdefgh . efabcdgh -> efgh . efgh

[[slice(0, 4, None)]]
efgh . efgh ->  . efgh

# expected
efgh . efgh ->  .