microsoft/presidio

Context words are used outside the suffix/prefix window

omri374 opened this issue · 5 comments

I'm new to Presidio (started working with the code yesterday), but I can't figure out why I'm getting the results I am. Code is below. It doesn't seem to be recognizing "cents" in the context. However, if I turn it to 'cent' everything works fine. But that brings up another question, if it's basing the suffix count on "dollars", why is 'Six' (in Sixty) tagged? I assume I'm misunderstanding something. Any help would be appreciated.

from typing import List
import pprint

from presidio_analyzer import (
    AnalyzerEngine,
    PatternRecognizer,
    EntityRecognizer,
    Pattern,
    RecognizerResult,
)
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

text = "Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents?"

regex = r"(zero|one|two|three|four|five|six|seven|eight|nine)"
currency_pattern = Pattern(name="currency_pattern (strong)", regex=regex, score=.01)

currency_recognizer_with_context = PatternRecognizer(
    supported_entity='CURRENCY',
    patterns=[currency_pattern],
    context=[
        'dollars',
        'cents',
    ]
)

context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=1, 
    min_score_with_context_similarity=1,
    context_prefix_count=0,
    context_suffix_count=6,
)

registry = RecognizerRegistry()
registry.add_recognizer(currency_recognizer_with_context)
analyzer = AnalyzerEngine(registry=registry, context_aware_enhancer=context_aware_enhancer)

res = analyzer.analyze(text = text, language='en')

Output:
[type: CURRENCY, start: 41, end: 45, score: 1, type: CURRENCY, start: 61, end: 65, score: 1, type: CURRENCY, start: 78, end: 81, score: 1, type: CURRENCY, start: 84, end: 89, score: 0.01]

Originally posted by @mmoody-vv in #1443

This looks like a bug.

To reproduce:

res = analyzer.analyze(text = text, language='en', return_decision_process=True)

for ress in res:
    print()
    print(f"text: {text[ress.start:ress.end]}," 
    f"\nentity: {ress.entity_type}, "
    f"\nscore before: {ress.analysis_explanation.original_score}"
    f"\nscore context improvement: {ress.analysis_explanation.score_context_improvement}"
    f"\nsupporting context word: {ress.analysis_explanation.supportive_context_word}")
text: Five,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Nine,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Six,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Seven,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0
supporting context word: 

Looks like this might be due to the models Part-of-Speech tagging rather than a Presidio bug.

The above example uses the default NLP Spacy model en_core_web_lg with the text Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents.

  • When Dollars has a capital D, it is categorised as a Proper Noun and the lemma is "Dollars"
  • When dollars has a lowercase D, it is categorised as a Noun and the lemma is "dollar"
  • Due to the cents position in the sentence it is always categorised as a Noun with a lemma of "cent", whether or not the C is capitalised

This can be seen with the following code:

import spacy

nlp = spacy.load("en_core_web_lg")

texts = [
"Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven Cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine dollars and Sixty-Seven cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine dollars and Sixty-Seven Cents",
"Will you be paying the entire balance of Sixty-Seven Cents and Five Hundred Thirty-Nine dollars",
]

for text in texts:
    print("\n", text)

    print(
        "Text:",  # Text: The original word text.
        "Lemma:",  # Lemma: The base form of the word.
        "POS:",  # POS: The simple universal part-of-speech tag.
        "Tag:",  # Tag: The detailed part-of-speech tag.
        "Alpha:",  # Alpha: Is the token an alpha character?
        "Stop:",  # Stop: Is the token part of a stop list, i.e. the most common words of the language?
        sep="\t"
    )
    for token in nlp(text):
        print(token.text, token.lemma_, token.pos_, token.tag_, token.is_alpha, token.is_stop, sep="\t")

Interestingly if the en_core_web_sm model is used, then dollars is always categorised as a noun. So @mmoody-vv you could look at using this model.

As the LemmaContextAwareEnhancer compares the context words to lemmas rather than the actual words in the text, I think using the singular form of the words is best. I can't see this anywhere in the docs, happy to add it if this is correct @omri374? So in this case, using "dollar" and cent should give you the behavior you're expecting @mmoody-vv

@hhobson thanks for this analysis! I found it surprising that Dollars's lemma is Dollars. This could be causing this issue. According to your analysis, it seems that a fix would be to lowercase the token prior to lemmatizing it, but that's not that straightforward as spaCy runs lemmatization and NER together, and we wouldn't want to pass a lowercase sentence as it would affect NER.

I agree, lowercasing the text doesn't feel the right thing to do. Especially as in this case the different sized spaCy models behaved differently, so things might change in future versions.

I think the best approach is to recommend using singular form context words, like dollar rather than dollars. When I tested this it produced the expected behavior of boosting the score.

Would that solve the problem if the sentence has upper case plurals to begin with? We would end up comparing dollar with Dollars