Additional filter suggestion: remove lines with repeated content
yvesscherrer opened this issue · 3 comments
Not sure how useful this is, but this is an idea that came to mind when filtering backtranslations.
Sentences like the following are probably low-quality and should be removed:
Ahora bien, el que quiera ser el primero entre ustedes deberá ser su servidor, diferentes plantas para ser un buen pescador y un buen pescador para ser un buen pescador y un buen pescador para ser un buen pescador y un buen pescador para ser un buen pescador
Parameters would be:
- minimum length of the repeated sequence (in characters or words)
- minimum number of repetitions
I have made a prototype that seems to work well in practice:
import math
def find_repeats(x, lengths_to_check = [1,2,3,4], min_repeat_length=3):
"""
Identifies repeated phrases, for use in identifying stuttering in NMT output
Arguments:
x (str, list(str), required) -- input to search for repeats in
lengths_to_check (list(int), int, default [1,2,3,4]) -- length of sequence of tokens to search for
min_repeat_length (int, default 3) -- minimum number of times the repeat must occur
"""
# Input validation
if isinstance(x, str):
# x is not tokenized; tokenize by space
x_split = x.split()
elif isinstance(x, list):
# x is already tokenized
x_split = x
else:
raise TypeError("Input must be str or list(str)")
# If lengths_to_check is an int, make it a list so it can be iterated over
if isinstance(lengths_to_check, int):
lengths_to_check = [lengths_to_check]
# Loop over each token in the string, from left to right
for ind in range(len(x_split)):
# Check for phrase repeats of length i for i in lengths_to_check
for phrase_len in lengths_to_check:
if ind+phrase_len < len(x_split):
if x_split[ind:(ind+phrase_len)] == x_split[(ind+phrase_len):(ind+2*phrase_len)]:
# We have a match - check to see how many times it repeats
num_repeats = 1
found_match=True
match_idx = ind+phrase_len
while found_match:
if x_split[ind:(ind+phrase_len)] == x_split[match_idx:(match_idx+phrase_len)]:
num_repeats+=1
match_idx+=phrase_len
else:
found_match=False
# Return once a single match that is long enough is found; do not find all matches
if num_repeats >= min_repeat_length:
return {"match":' '.join(x_split[ind:(ind+phrase_len)]),"num_repeats":num_repeats, "repeat_length":phrase_len}
# No matches were found
return None
> find_repeats("Ahora bien, el que quiera ser el primero entre ustedes deberá ser su servidor, diferentes plantas para ser un buen pescador y un buen pescador para ser un buen pescador y un buen pescador para ser un buen pescador y un buen pescador para ser un buen pescador", lengths_to_check = [1,2,3,4,5,6,7,8,9,10])
>>> {'match': 'para ser un buen pescador y un buen pescador',
'num_repeats': 3,
'repeat_length': 9}
Should I modify it to be an OpusFilter filter and submit a pull request ?
Thanks! I had some extra backtranslated data from Yves to test this, and indeed it seems to be working nicely (at least with good precision, recall is of course more difficult to estimate). Found 4778 matches from 164725 segments.
So sure, go on and create a PR! (Some instructions here.) I can also help, but better you make at least the first commit so you get the credit 🙂