NVIDIA/NeMo-Curator

NnoeAlphaNumericFilter has impelemented wrong[BUG]

zahramahani opened this issue · 2 comments

Describe the bug

the score function of this filter should be

    def score_document(self, text):
        nchar = len(text)
        if nchar > 0:
            score = (nchar - len(" ".join(self._constants.regex_alphanum.findall(text)))) / nchar
        else:
            # Remove the document if it is empty
            score = 1.0
        return score

instead of

  def score_document(self, text):
       nchar = len(text)
       if nchar > 0:
           score = (nchar - len(self._constants.regex_alphanum.findall(text))) / nchar
       else:
           # Remove the document if it is empty
           score = 1.0
       return score

otherwise length of matched array will be considered instead of all matched characters

Thanks for raising the issue. @zahramahani.
@ryantwolf Does it make sense to change the filter behavior as mentioned above?

I'm not sure I am following what your concern is @zahramahani . When I consider the following example, everything seems to be fine.

# For reference
class NonAlphaNumericFilter(DocumentFilter):

    def __init__(self, max_non_alpha_numeric_to_text_ratio=0.25):
        super().__init__()
        self._cutoff = max_non_alpha_numeric_to_text_ratio
        self._name = "alpha_numeric"

    def score_document(self, text):
        nchar = len(text)
        if nchar > 0:
            score = (nchar - len(regex_alphanum.findall(text))) / nchar
        else:
            # Remove the document if it is empty
            score = 1.0
        return score

regex_alphanum = re.compile("[a-zA-Z0-9\n?!,.]")
test_document = "This is a test case."
nchar = len(test_document)
matches = regex_alphanum.findall(text)
print(matches) # Outputs: ['T', 'h', 'i', 's', 'i', 's', 'a', 't', 'e', 's', 't', 'c', 'a', 's', 'e', '.']

score = (nchar - len(matches)) / nchar
print(score) # Outputs: 0.2

This all seems fine to me, meanwhile with your change the following would happen

regex_alphanum = re.compile("[a-zA-Z0-9\n?!,.]")
test_document = "This is a test case."
nchar = len(test_document)
matches = regex_alphanum.findall(text)
joined_matches = " ".join(matches)
print(joined_matches) # Outputs: 'T h i s i s a t e s t c a s e .'

score = (nchar - len(joined_matches)) / nchar
print(score) # Outputs: -0.55

This score seems to be incorrect too. Could you clarify what you think the bug is in the original implementation? Perhaps an example where it outputs the wrong score would be best.