Performance optimizations: bulk anonymisation

Question

Performance optimizations: bulk anonymisation

Opened this issue 3 years ago · 4 comments

During testing of bulk anonymisation, there seem to be a few areas where performance can be optimized (although there may be correctness / auditing tradeoffs for some of these).

I'll try to provide some supporting statistics on each of these soon - but as a rough preface, I've been aiming to bring a ~12-hour estimated bulk anonymisation down to less than 3 hours (and ideally reduce it further than that).

Modifications applied so far towards this goal have included:

Providing for_bulk=True as an argument to the anonymise method (nb: reduces audit logging)
Setting the force=True argument to the anonymise method and flipping the order of the self.is_anonymised() and not force conditionals -- so that no DB exists() query is made when force mode is enabled (nb: does this risk introducing incorrect/circular anonymisation?)
Optimizing the anonymiser __getattr__ implementation by using dictionary lookups rather than list iterations to retrieve anonymisers (nb: no evidence of improvements here, yet)

Answer 1 · 2021-12-09T16:07:10.000Z

Hi @jayaddison-collabora. Thanks for the PRs relating to this, I've merged down #43 and #45 and added a note to #46.

Going the leave this open issue for more investigation. I'd be interested to know what kind of numbers you were looking at so we could do some benchmarking. There is probably some more improvements we could make to the management command depending on the situation.

Thanks
James

Answer 2 · 2022-01-19T16:45:48.000Z

In relation to this, we've added some small perfomance improvements to the latest release.

Firstly the adding of records to the log table is now bulked, previously only the PrivacyAnonymised objects were bulked if the bulk argument was used. The anonymised objects are still not in bulk, so any signals outside of gdpr would be respected, however, we could look at allowing users to control this via a setting so gdpr as a whole acts in bulk.

Secondly for the purpose of bulk anonymisation we've also added the option to defer/disable the records created to the log table via GDPR_LOG_ON_ANONYMISE (https://django-gdpr-assist.readthedocs.io/en/latest/installation.html#gdpr-log-on-anonymise-true) to give the user control of when this happens, i.e the post_anonymise signal could be used to defer this to celery or batched via processing the values in PrivacyAnonymised later.

Answer 3 · 2022-02-07T14:58:54.000Z

Thanks @jamesoutterside - just (belatedly) acknowledging your comments here. I'll hope to have a bit more of a look at this soon. I did keep a note of some of the anonymisation throughput/benchmark figures when working on the pull requests initially, if I remember correctly, so there may be some data near-ready to provide.

Answer 4 · 2022-03-04T10:43:07.000Z

Some references here:

Analysis / Benchmarking

https://gitlab.collabora.com/tools/chronophage/-/merge_requests/815#note_93443
- adding anonymisation of a string field on a model with ~650k records resulted in a 12h bulk anonymisation duration
- commentary and explanation of changes applied to bring the bulk anonymisation duration down to 3h

Deployment

https://gitlab.collabora.com/tools/chronophage/-/merge_requests/820
- from: https://github.com/jayaddison-collabora/django-gdpr-assist.git@1baf994e21575074d0d9b03afbc07236d8c88061
- to: https://github.com/jayaddison-collabora/django-gdpr-assist.git@77838823cd3c4ac221b428a0e3b093a20e848f19
- results: bulk anonymisation process reduced from 12h to 3h duration