realrolfje/anonimatron

feature request: Non-consistent anonimization for some fields

Closed this issue · 3 comments

Some big datasets may contain free text description fields which may contain personal data. These fields generate a very large synonym.xml file because almost all description fields are unique.

In some cases, it is not important to keep these fields unique between runs. It would be nice to be able to anonymize the field without generating a synonym, so that the synonym file does not contain data which should be masked but is never used in a test or does not need to be consistent between tables, files or runs.

In one particular case, a syonym file of over 400MB is read and written, where the reading and storing of the file takes a lot of time and CPU (base64 encoding and xml parsing).

We have a XMLAnonymizer that parses for some configurable fields by extending the config scheme, and yes that is data that is generally too big to be persisted in synonym file.
This can most probably be solved by a Synonym keeping the from field static like NullSynonym
I am trying to find time to share this working code with you shortly.
As well as some other general Anonymizers

Please do! Meanwhile I have changed the configuration so that there is an optional configuration attribute to indicate a column to be short-lived, or "transient".

Released in version 1.10.0