microsoft/presidio

Configure AnalyzerEngine from file

omri374 opened this issue · 2 comments

Is your feature request related to a problem? Please describe.
In many use-cases, especially around the Docker based option, it is challenging to configure the AnalyzerEngine for a specific scenario. For example, in order to have an API supporting multiple languages, it is required to change the code in the app.py:

self.engine = AnalyzerEngine()

Having a way to configure which initial parameters are used (languages, nlp engine, recognizers, default score etc.) will allow a code-free configuration in both Docker based use-cases and for a more configurable Python pipeline.

Describe the solution you'd like

  • Having a configuration file with all the parameters, and a utility to read it and create the custom AnalyzerEngine instance
  • Add the ability to read this conf file in the Docker file.

Describe alternatives you've considered
An alternative would be documentation of how to change app.py, but code would still have to be changed.

Additional context
Presidio already has several conf file, e.g.:

Hey ! Thanks for this PR.

Can I use it to use an other transformer model ? Like this one : https://huggingface.co/Jean-Baptiste/camembert-ner

I was thinkings about using a conf file yaml like this :

nlp_engine_name: transformers
models:
  -
    lang_code: fr
    model_name:
      spacy: fr_core_news_lg
      transformers: Jean-Baptiste/camembert-ner

ner_model_configuration:
  labels_to_ignore:
  - O
  aggregation_strategy: simple # "simple", "first", "average", "max"
  stride: 16
  alignment_mode: strict # "strict", "contract", "expand"
  model_to_presidio_entity_mapping:
    PER: PERSON
    LOC: LOCATION
    ORG: ORGANIZATION
    AGE: AGE
    ID: ID
    EMAIL: EMAIL
    PATIENT: PERSON
    STAFF: PERSON
    HOSP: ORGANIZATION
    PATORG: ORGANIZATION
    DATE: DATE_TIME
    PHONE: PHONE_NUMBER
    HCW: PERSON
    HOSPITAL: ORGANIZATION

  low_confidence_score_multiplier: 0.4
  low_score_entity_names:
  - ID

Can this work ? Without your PR, fr language never seems available.

Thanks.

Hi @GautierT, are you looking to run this through a REST API?
If no, then you can configure your model using the standard NlpEngineProvider logic, for example see this documentation
If yes, then the only additional change needed is on app.py to pass the NlpEngine into the AnalyzerEngine. Instead of this:

self.engine = AnalyzerEngine()

Have this:

class Server:
    """HTTP Server for calling Presidio Analyzer."""

    def __init__(self):
        fileConfig(Path(Path(__file__).parent, LOGGING_CONF_FILE))
        self.logger = logging.getLogger("presidio-analyzer")
        self.logger.setLevel(os.environ.get("LOG_LEVEL", self.logger.level))
        self.app = Flask(__name__)
        self.logger.info("Starting analyzer engine")
        
        provider = NlpEngineProvider(conf_file=PATH_TO_CONF)
        nlp_engine = provider.create_engine()
        self.engine = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["fr"])
        self.logger.info(WELCOME_MESSAGE)