microsoft/presidio

Context with hash marks do not work as expected

claesmk opened this issue · 3 comments

Describe the bug
In the predefined US_SSN recognizer context the ssn# and ss# context words do not work as expected because spacy tokenizes the hash mark:

[decision_process][INFO][None][nlp artifacts:{"entities": ["My ssn#"], "tokens": ["My", "ssn", "#", "is", "123", "-", "45", "-", "1234"], "lemmas": ["my", "ssn", "#", "be", "123", "-", "45", "-", "1234"], "tokens_indices": [0, 3, 6, 8, 11, 14, 15, 17, 18], "keywords": ["ssn", "123", "45", "1234"], "scores": [0.85]}]

ssn# works OK because this is tokenized as ["ssn" and "#"] and ssn is also in the context list, however, ss# does not work

[decision_process][INFO][None][nlp artifacts:{"entities": [], "tokens": ["My", "ss", "#", "is", "123", "-", "45", "-", "1234"], "lemmas": ["my", "ss", "#", "be", "123", "-", "45", "-", "1234"], "tokens_indices": [0, 3, 5, 7, 10, 13, 14, 16, 17], "keywords": ["ss", "123", "45", "1234"], "scores": []}]

To Reproduce
Steps to reproduce the behavior:

  1. Use US_SSN entity and a string with ss#
  2. See that it is not detected as context (0.5 default score is not improved)
{'recognizer': 'UsSsnRecognizer', 'pattern_name': 'SSN5 (medium)', 'pattern': '\\b([0-9]{3})[- .]([0-9]{2})[- .]([0-9]{4})\\b'...edium)`', 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': None, 'regex_flags': regex.I|M|S}

Expected behavior
All words in the predefined context should be detected correctly. If values with a hash mark cannot be detected correctly they should at least be removed from the list.

Screenshots
See logging and debugger output above

Additional context
n/a

Thanks, and apologies for the delay in answering. The context words are currently used by the existing context mechanism (LemmaContextAwareEnhancer) but there are alternative implementations one can think of, for example comparing substrings, where ssn# would could be leveraged. if you or anyone else is interested in creating an alternative context approach, we'd be happy to review it and incorporate it into the package.

@omri374 that's fine - however in the meanwhile do you think it makes sense to remove ss# and ssn# from the default CONTEXT since they don't actually work?

Sure, that makes sense. Would you like to create a PR?