synalp/jtrans

Anonymize "Pierre"

Closed this issue · 2 comments

If I get it well, anonymization is done given a vocabulary of proper nouns, and all occurrences of them in the text are anonymized. It is not (?) done by identifying every single occurrence of word to anonymize. Also, it converts everything to lowercase when detecting the proper nouns. The issues I see:

  • Detection of occurrences to anonymize depends on the tokenization, so it is not 100% guarantee to be correct.
  • "Sylvain Pierre" (ok) vs. "la pierre" vs. "Jésus dit à Pierre" (not ok)

There are two kinds of anonymization:

  1. "Find-and-replace"-style anonymization (e.g. Pierre) when the markup does not define which words should be anonymized.
  2. Words specifically marked up to be anonymized thanks to a special syntax e.g. *Machin* -- this is how it's done in CRFP/CORALROM

OK, occurrences can be anonymized individually, so the issue is closed. Thanks !