NnoeAlphaNumericFilter has impelemented wrong[BUG]
zahramahani opened this issue · 2 comments
zahramahani commented
Describe the bug
the score function of this filter should be
def score_document(self, text):
nchar = len(text)
if nchar > 0:
score = (nchar - len(" ".join(self._constants.regex_alphanum.findall(text)))) / nchar
else:
# Remove the document if it is empty
score = 1.0
return score
instead of
def score_document(self, text):
nchar = len(text)
if nchar > 0:
score = (nchar - len(self._constants.regex_alphanum.findall(text))) / nchar
else:
# Remove the document if it is empty
score = 1.0
return score
otherwise length of matched array will be considered instead of all matched characters
ayushdg commented
Thanks for raising the issue. @zahramahani.
@ryantwolf Does it make sense to change the filter behavior as mentioned above?
ryantwolf commented
I'm not sure I am following what your concern is @zahramahani . When I consider the following example, everything seems to be fine.
# For reference
class NonAlphaNumericFilter(DocumentFilter):
def __init__(self, max_non_alpha_numeric_to_text_ratio=0.25):
super().__init__()
self._cutoff = max_non_alpha_numeric_to_text_ratio
self._name = "alpha_numeric"
def score_document(self, text):
nchar = len(text)
if nchar > 0:
score = (nchar - len(regex_alphanum.findall(text))) / nchar
else:
# Remove the document if it is empty
score = 1.0
return score
regex_alphanum = re.compile("[a-zA-Z0-9\n?!,.]")
test_document = "This is a test case."
nchar = len(test_document)
matches = regex_alphanum.findall(text)
print(matches) # Outputs: ['T', 'h', 'i', 's', 'i', 's', 'a', 't', 'e', 's', 't', 'c', 'a', 's', 'e', '.']
score = (nchar - len(matches)) / nchar
print(score) # Outputs: 0.2
This all seems fine to me, meanwhile with your change the following would happen
regex_alphanum = re.compile("[a-zA-Z0-9\n?!,.]")
test_document = "This is a test case."
nchar = len(test_document)
matches = regex_alphanum.findall(text)
joined_matches = " ".join(matches)
print(joined_matches) # Outputs: 'T h i s i s a t e s t c a s e .'
score = (nchar - len(joined_matches)) / nchar
print(score) # Outputs: -0.55
This score seems to be incorrect too. Could you clarify what you think the bug is in the original implementation? Perhaps an example where it outputs the wrong score would be best.