ChenghaoMou/text-dedup

boundaries of sub-strings

MiladMolazadeh opened this issue · 2 comments

Hello!

I'm currently using a suffix array and Persian language text. However, in some examples, the outcome of deduplication is not ideal when removing substrings from the text. This leads to boundaries of strings being overlapped by words, resulting in a deprecated and sometimes meaningless text. How can I rectify this issue?

one example (translated to english):

ORIGINAL: According to BBC and quoted by Currency, the dollar to ruble rate increased by 0.32% to 55.19 rubles and the euro decreased by 0.36% to 56.09 rubles.

RESULT AFTER DEDUP: o ruble rate increased by 0.32% to 55.19 rubles and the euro decreased by 0.36% to 56.09 rubles.

This is somewhat expected behaviour from the algorithm. See #19. It will break the text flow, but it should be fine with a language modelling task with a large corpus.

Stale issue message