Performance optimizations: bulk anonymisation
Opened this issue · 4 comments
During testing of bulk anonymisation, there seem to be a few areas where performance can be optimized (although there may be correctness / auditing tradeoffs for some of these).
I'll try to provide some supporting statistics on each of these soon - but as a rough preface, I've been aiming to bring a ~12-hour estimated bulk anonymisation down to less than 3 hours (and ideally reduce it further than that).
Modifications applied so far towards this goal have included:
- Providing
for_bulk=True
as an argument to theanonymise
method (nb: reduces audit logging) - Setting the
force=True
argument to theanonymise
method and flipping the order of theself.is_anonymised() and not force
conditionals -- so that no DBexists()
query is made when force mode is enabled (nb: does this risk introducing incorrect/circular anonymisation?) - Optimizing the anonymiser
__getattr__
implementation by using dictionary lookups rather than list iterations to retrieve anonymisers (nb: no evidence of improvements here, yet)
Hi @jayaddison-collabora. Thanks for the PRs relating to this, I've merged down #43 and #45 and added a note to #46.
Going the leave this open issue for more investigation. I'd be interested to know what kind of numbers you were looking at so we could do some benchmarking. There is probably some more improvements we could make to the management command depending on the situation.
Thanks
James
In relation to this, we've added some small perfomance improvements to the latest release.
Firstly the adding of records to the log table is now bulked, previously only the PrivacyAnonymised
objects were bulked if the bulk argument was used. The anonymised objects are still not in bulk, so any signals outside of gdpr would be respected, however, we could look at allowing users to control this via a setting so gdpr as a whole acts in bulk.
Secondly for the purpose of bulk anonymisation we've also added the option to defer/disable the records created to the log table via GDPR_LOG_ON_ANONYMISE
(https://django-gdpr-assist.readthedocs.io/en/latest/installation.html#gdpr-log-on-anonymise-true) to give the user control of when this happens, i.e the post_anonymise
signal could be used to defer this to celery or batched via processing the values in PrivacyAnonymised
later.
Thanks @jamesoutterside - just (belatedly) acknowledging your comments here. I'll hope to have a bit more of a look at this soon. I did keep a note of some of the anonymisation throughput/benchmark figures when working on the pull requests initially, if I remember correctly, so there may be some data near-ready to provide.
Some references here:
Analysis / Benchmarking
- https://gitlab.collabora.com/tools/chronophage/-/merge_requests/815#note_93443
- adding anonymisation of a string field on a model with ~650k records resulted in a 12h bulk anonymisation duration
- commentary and explanation of changes applied to bring the bulk anonymisation duration down to 3h
Deployment
- https://gitlab.collabora.com/tools/chronophage/-/merge_requests/820
- from: https://github.com/jayaddison-collabora/django-gdpr-assist.git@1baf994e21575074d0d9b03afbc07236d8c88061
- to: https://github.com/jayaddison-collabora/django-gdpr-assist.git@77838823cd3c4ac221b428a0e3b093a20e848f19
- results: bulk anonymisation process reduced from 12h to 3h duration