Practical-Private-Logs

Now is a good time to sanitize logs of behavioural data. This is a collection of pointers to software and guidance on how to do that.

If you have a bunch of logs of behavioural data that is not anonimized, and you would like to preserve something in their place which:

  1. does not reveal private information about your users
  2. is useful in learning about them

Doing just (1) is trivial: just delete the logs, doing just (2) alone is also trivial, keep everything. What can be done to do both? Derivative syntehtic dataset that is differentially private and preserves up to some order the joint marginal probability distribution over the data. We can then delete the original and keep the synthetic dataset to learn from in the future.

Differential privacy guarantees that nothing that could not happen without access to your data will happen with access to your data. In fact it makes a stronger quantitative guarantee: The chance that any specific thing happens (really, anything at all) with access to your data is at most a multiple X of the chance it would happen without your data. That "multiple" X is part of the guarantee and it determines how much privacy you get: a value of 1.0 would be perfect privacy (which by definition ignores your data), small values like 1.01 are pretty great, whereas values like 10.0 are less amazing but still non-trivial. -- Differential privacy: An illustrated primer

We proceed under the assumption of one record per user (TODO: discuss practical steps to take if multiple users).

Methods (and implementations)

How To

Guides and experiences

Differential Privacy – A Primer for the Perplexed