Now is a good time to sanitize logs of behavioural data. This is a collection of pointers to software and guidance on how to do that.
If you have a bunch of logs of behavioural data that is not anonimized, and you would like to preserve something in their place which:
- does not reveal private information about your users
- is useful in learning about them
Doing just (1) is trivial: just delete the logs, doing just (2) alone is also trivial, keep everything. What can be done to do both? Derivative syntehtic dataset that is differentially private and preserves up to some order the joint marginal probability distribution over the data. We can then delete the original and keep the synthetic dataset to learn from in the future.
Differential privacy guarantees that nothing that could not happen without access to your data will happen with access to your data. In fact it makes a stronger quantitative guarantee: The chance that any specific thing happens (really, anything at all) with access to your data is at most a multiple X of the chance it would happen without your data. That "multiple" X is part of the guarantee and it determines how much privacy you get: a value of 1.0 would be perfect privacy (which by definition ignores your data), small values like 1.01 are pretty great, whereas values like 10.0 are less amazing but still non-trivial. -- Differential privacy: An illustrated primer
We proceed under the assumption of one record per user (TODO: discuss practical steps to take if multiple users).
- RAPPOR: Privacy-Preserving Reporting Algorithms https://github.com/google/rappor https://arxiv.org/abs/1407.6981
- Has the client send the data transformed so that the server never needs to see the non-anonimized version
- originates in google, also a production quality implementation is part of chromium http://www.chromium.org/developers/design-documents/rappor
- a blog post covering the release https://security.googleblog.com/2014/10/learning-statistics-with-privacy-aided.html
-
Exponential Mechanism with the Multiplicative Weights paper
- A simple and fairly practical scheme for the batch one off setting
-
DPSynthesizer source? , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4496798/
-
Differential Privacy: An Economic Method for Choosing Epsilon https://arxiv.org/abs/1402.3329
-
How Can We Analyze Differentially-Private Synthetic Datasets? http://repository.cmu.edu/cgi/viewcontent.cgi?article=1059&context=jpc