Simple test
KeremTurgutlu opened this issue · 0 comments
KeremTurgutlu commented
Thanks a lot for open sourcing your amazing work!
I was just getting my hands dirty with the code and wrote a simple test:
from text_dedup.exact_dedup import GoogleSuffixArrayDeduplicator
k=3
deduplicator = GoogleSuffixArrayDeduplicator(k=k, google_repo_path="deduplicate-text-datasets")
texts = ['aaaaaaaaaaaabbbccccc', 'aaaaaaaaaaaaccccc']
slices = deduplicator.fit_predict(texts)
print(f"k:{k}")
print(slices)
def remove_slice_list(text, slice_list):
offset = 0
for s in slice_list:
text = text[:(s.start-offset)] + text[(s.stop-offset):]
offset += s.stop - s.start
return text
for slice_list,text in zip(slices, texts):
if slice_list != []:
print(f"{text} -> {remove_slice_list(text, slice_list)}")
# which prints:
[[slice(0, 12, None)], [slice(0, 12, None)]]
aaaaaaaaaaaabbbccccc -> bbbccccc
aaaaaaaaaaaaccccc -> ccccc
Shouldn't the c's from both strings get removed as well? Maybe it might be due to my unfamiliarity with the algorithm and just curious.
# expected
aaaaaaaaaaaabbbccccc -> bbb
aaaaaaaaaaaaccccc ->
For example, I imagine in a real world scenario we would like to remove both the repeating headers and footers of a website.
Also, tried the example from the README:
k=4
deduplicator = GoogleSuffixArrayDeduplicator(k=k, google_repo_path="deduplicate-text-datasets")
# texts = ['eabcdfgh . efabcdgh']
texts = ['abcdefgh . efabcdgh']
slices = deduplicator.fit_predict(texts)
print(f"k:{k}")
print(slices)
clean_texts = []
for slice_list,text in zip(slices, texts):
if slice_list != []:
clean_text = remove_slice_list(text, slice_list)
print(f"{text} -> {clean_text}")
clean_texts.append(clean_text)
else:
clean_texts.append(text)
texts = clean_texts
slices = deduplicator.fit_predict(texts)
print(slices)
for slice_list,text in zip(slices, texts):
if slice_list != []:
clean_text = remove_slice_list(text, slice_list)
print(f"{text} -> {clean_text}")
# prints
[[slice(0, 4, None), slice(13, 17, None)]]
abcdefgh . efabcdgh -> efgh . efgh
[[slice(0, 4, None)]]
efgh . efgh -> . efgh
# expected
efgh . efgh -> .