diffix/reference

Align on functionality for v0

Closed this issue · 7 comments

This decision takes a step back to align on what features we should support in a version 0, and why.

We are eventually targeting multiple levels of protection (lacking names for these versions as of yet):

  • (a) protecting against accidental disclosure
  • (b) very strong, but potentially not safe against a very malicious and resourceful attacker
  • (c) bullet proof, safe to open to the world

In terms of the complexity of the anonymization the versions don't follow an obvious progression.
(a) has the lowest implementation complexity and lowest amount of protections. Most things are allowed and no sophisticated protections are offered. Our system really doesn't care much what the user does.
In terms of overall complexity (c) is likely the next in as I believe it will offer a restricted subset of the functionality of (b) that we deem to be safe.

I believe whatever functionality we build for version (a) is also required in the other two versions. Therefore I think the goal should be to start with version (a) and do as follows:

  • build an early reference implementation of (a)
  • let people play with the reference implementation
  • if not useful then revert and think some more
  • if moderately useful then make a Postgres implementation of (a) that is actually usable in the wild
  • subsequently continue with (b) or (c)

Functionality I think should be part of a first iteration of (a):

  • anonymizing aggregates (let's start with count, sum and avg?)
  • low count filtering
  • support for multiple aids (this keeps being a recurring issue and requirement for people we talk to)

What we initially skip:

  • remaining aggregates
  • window functions

Thoughts? @diffix/developers, @yoid2000, @fjab

fjab commented

Agree with most. Whether we are actually targeting levels a-c for an eventual real tool is as of yet unknown.

Regarding functionality, I always find it interesting to look at the Google Differential Privacy project: https://github.com/google/differential-privacy. They have count, sum, mean, variance, stdev, order statistics (incl. min, max, median...) and "automatic bounds approximation". They probably did a bunch of research on what is needed (and feasible for them), so it's an interesting immediate point to aim for.

We are eventually targeting multiple levels of protection (lacking names for these versions as of yet):

(a) protecting against accidental disclosure
(b) very strong, but potentially not safe against a very malicious and resourceful attacker
(c) bullet proof, safe to open to the world

I disagree that version (b) should be attackable. I think we should be targeting the same level of protection as the Aircloak product, which is GDPR-strength, no known vulnerabilities. The distinction between b and c is the amount of scrutiny each mechanism has received.

I'll set up a list of criteria for how to gauge safety (possibly including criteria like:

  • how many DPAs have approved the features
  • how many years of public scrutiny
  • how many different teams have done vulnerability analysis
  • how many rounds of bounty program
  • etc.

build an early reference implementation of (a)
let people play with the reference implementation
if not useful then revert and think some more

This doesn't make a lot of sense in terms of testing what features people need, because it has all features. If (a) is not useable, it will be because the amount of protection it gives isn't attractive to users. In other words, they'd rather just go with pseudonymization.

If we want to test utility, then we need to start with a relatively bare-bones set of features and add to that. Interestingly, a bare-bones set of features matches level C the best.

Actually, I suggest we start with one aggregate, 'count(distinct AID)', and GROUP BY. Just that.

This actually gives genuine (if limited) utlilty and is among the simplest to implement. It allows for counting users, histograms, and AND logic by virtue of the GROUP BY. If one wants buckets, one can do so with a view. Min, max, average, standard deviation, and median can be estimated from the histogram.

We still need to agree on what attacks we allow with (a), actually. What I have in mind requires that we do noise proportional to contribution and flattening. These require per-aggregate functionality, so not nearly as simple as a bare-bones implementation of (c).

Regarding functionality, I always find it interesting to look at the Google Differential Privacy project: .... They probably did a bunch of research on what is needed (and feasible for them), so it's an interesting immediate point to aim for.

Ha ha. I doubt they did much research on what is needed. Rather this list is kindof a no-brainer set of things. But certainly they have done tons and tons of research on what is feasible.

This doesn't make a lot of sense in terms of testing what features people need, because it has all features. If (a) is not useable, it will be because the amount of protection it gives isn't attractive to users. In other words, they'd rather just go with pseudonymization.

Note that I never said anything about testing the utility of features. What we should evaluate is whether it is a point in the design space that makes sense to build and invest further in.

I disagree that the reference implementation needs to support all features. Sure, a real implementation eventually would. Design point (a) has the benefit that we don't care whether a certain function opens up an attack or not, because we are by design not protecting against it. Hence we don't need to implement all functionality in our reference implementation as it doesn't add any value in terms of testing at all.

The transition from reference implementation to real world implementation of (c) seems more complex to me. There are multiple reasons for this:

  • The stakes are much higher because (c) is targeting a very high level of protection and is meant to be entirely safe
  • We need to ensure that there are no ways of circumventing our set of restricted functionality that would break the anonymity
  • If we go by a set of criteria as specified by you where the functionality included in (c) needs to have been through multiple rounds of attack challenges before it can be released, then it seems rather silly to start with (c) as an initial candidate as it means we cannot release anything useful for the foreseeable future.

Even though the ultimate goal of (a) is to allow everything, I don't think we should strive to allow everything from the very beginning. I think we should start with a small subset of aggregates. Whether that is just implementing count(distinct aid) or it is implementing a different subset of aggregates is something I am happy to discuss. Then we can quickly add more functionality as we go.

In summary: I think the progression should be (a) then (b) then (c). Where all of these versions start out with a reduced scope, and functionality once it has proven itself at one level, progresses up through the ranks if it is safe. Either way (a) (which is the lowest level of trust) seems like where we need to start.


You seem to specify how one might classify whether something is safe for a version. I think that misses the point.
What we need to agree on is what a certain level should be and who it is for. How we determine whether a given functionality is safe for a particular level or not is an important implementation detail, but an implementation detail all the same.

What I would like us to agree on is:

  • who are given levels for and what type of user and usage do they target
  • what environment could such a protection level be used in

With those types of criteria in mind my sense has been that we are working towards:

  • (a) a level that offers minimal protection, but can act as a nice safe guard internally in an organization, but should not be deemed safe for giving untrusted parties access to data. This level likely does not make it possible to work with data that the analyst couldn't otherwise already work with, but is a nice safeguard to avoid accidentally seeing the data of individual users, and for producing reports that can safely be shared with others
  • (b) a level that requires significant effort, and malicious intent, to break. Effort to the point where it is unrealistic that someone who is an employee in an organization and has real work to do would ever bother. Even if they did bother they would leave very clear traces of malicious behavior, and could be punished accordingly. This level aims to make it possible for analysts to work on data that otherwise would be out of bounds.
  • (c) a level with protections so strong that one can, without meaningful risk, open a dataset to the world or to a set of anonymous and untrusted third parties. It should offer protections so strong that even with time, resources, and malicious intent, an attacker should not manage to extract meaningful information

We can then spend time discussing how a piece of functionality needs to prove itself before it can be included in a given level. But I don't think that is relevant for this discussion.

(a) a level that offers minimal protection, but can act as a nice safe guard internally in an organization, but should not be deemed safe for giving untrusted parties access to data. This level likely does not make it possible to work with data that the analyst couldn't otherwise already work with, but is a nice safeguard to avoid accidentally seeing the data of individual users, and for producing reports that can safely be shared with others

In my mind, the target here is something where any data revealed to the public by a non-malicious analyst is GDPR anonymous. The fact that this might be very useful as a safe-guard internally in an organization is great, but not the target. We need to discuss.

(b) a level that requires significant effort, and malicious intent, to break. Effort to the point where it is unrealistic that someone who is an employee in an organization and has real work to do would ever bother. Even if they did bother they would leave very clear traces of malicious behavior, and could be punished accordingly. This level aims to make it possible for analysts to work on data that otherwise would be out of bounds.

So are you suggested we step away from the "no known vulnerabilities" goal, but rather have something that has known vulnerabilities but we consider them too expensive to exploit? I disagree, but again lets discuss.