Anonymize "Pierre"
Closed this issue · 2 comments
cerisara commented
If I get it well, anonymization is done given a vocabulary of proper nouns, and all occurrences of them in the text are anonymized. It is not (?) done by identifying every single occurrence of word to anonymize. Also, it converts everything to lowercase when detecting the proper nouns. The issues I see:
- Detection of occurrences to anonymize depends on the tokenization, so it is not 100% guarantee to be correct.
- "Sylvain Pierre" (ok) vs. "la pierre" vs. "Jésus dit à Pierre" (not ok)
jorio commented
There are two kinds of anonymization:
- "Find-and-replace"-style anonymization (e.g. Pierre) when the markup does not define which words should be anonymized.
- Words specifically marked up to be anonymized thanks to a special syntax e.g.
*Machin*
-- this is how it's done in CRFP/CORALROM
cerisara commented
OK, occurrences can be anonymized individually, so the issue is closed. Thanks !