elgentos/masquerade

Delete data and anonymize the remaining records

Tjitse-E opened this issue · 8 comments

The idea is that we will delete all of the older customer data (for example, delete customers that have been created more than 30 days ago), so that the DB dump will be a lot smaller + reducing Masquerade execution time. The remaining data should be anonymized so we can use it anymwhere.

Example config:

  customer_grid_flat:
    provider:
      delete: true
      where: "`created_at` < now() - interval 30 day"
    columns:
      name:
        formatter: name
      email:
        formatter: email
        unique: true
        nullColumnBeforeRun: true
      dob:
        formatter: dateTimeThisCentury
        optional: true
      billing_full:
         ....

Currently, Masquerade just executes the delete, then it moves on to the next table, leaving the remaining records in the table anonymized. Very logical, but it would be nice to have the possibility to delete AND anonymize.

What would be the best place to implement this feature?

Maybe @johnorourke might have an idea about this, since he built the delete part?

@peterjaap The original design for that was "you can either delete or anonymize, not both", but this is a good idea. We have several possible requirements:

  • delete a selection of records
  • anonymize a selection of records
  • both (perhaps with different 'where' statements)
  • none

So for maximum flexibility maybe we need to just allow different 'where' statements for anonymisation and deletion. However, delete: true previously switched off anonymisation!

Perhaps this approach:

  • delete_where to specify the records to be deleted
  • anonymize_where to specify the records to be anonymized
  • where would fill in both of those - which keeps backwards compatibility
  • The system would run the delete first (if delete:true), then the anonymize - exactly as it does now.

@IvanChepurnyi I can see your work on the DataProvider system, so it would be good to get your input on this. Should we avoid backward compatibility and go for a generic "actions" config, instead of using implcit actions? It's a balance between easy config with "sensible defaults", the learning curve for new users, and reducing unexpected behaviour.

@johnorourke i'm currently using delete_where in our builds (master...Tjitse-E:feature/partial-delete). The only problem there is that it is not backwards compatible, but this could be solved (if needed) by keeping where.

Adding both delete_where and anonymize_where seems like a good idea.

@johnorourke I like your approach, and if where is used for both delete and anonymize it won't break behavior as the anonymization step just will be 0 rows, as those were previously deleted.

There is probably an opportunity to hide this logic behind the TableConfigution class as checks for provider/where become quite complex. I will work on this issue next week.

Watching this, as I'm also interested in this feature. Until then, is it possible to run masquerade twice with two different configs?

I'm thinking I can run the anon, then export for a full anon backup.
Then come back and run the delete on the same db, then export a thin backup

Only problem is I need two different config file setups for this correct? I guess I could run two different phar's each with their own config, but that doesn't seem very elegant.

@SAN1TAR1UM the --config parameter (which gives it a directory of config yml files) can be used multiple times, so you can use the same phar but just add an extra set of configs for one of the runs.