diffix/reference

How do we handle multiple AIDs per row

Closed this issue · 4 comments

This is mostly a note that results from thinking through low effect detection.
It might turn out to be superfluous, but I wanted to have it recorded in case it's not.


In our new design we aim to make the query writing experience more natural. For example we allow
a mix of per user aggregates, and cross user aggregates, in different parts of a query, without forcing
anonymization to take place.

When a cross user aggregate takes place, we need to account for the fact that the resulting rows
pertain to multiple individuals, and what each users contribution is in a given row.

I believe this is also the case when not dealing with aggregates.

While we in Aircloak Insights ensured that any single non-aggregate row belonged to a single identifiable user,
and did so by requiring a JOIN condition on the AID column, we do not enforce this in Open Diffix.

Take the following query as an example of what should be allowed:

SELECT height, amount
FROM (
  SELECT height
  FROM users
) t, (
  SELECT amount
  FROM transactions
) s

The cross join ensures that each row is associated with one or two AIDs.
I.e. the resulting table might be something like the following, even no aggregation has taken place.

Height Amount AID contribution
183 200 [aid1]
181 200 [aid1; aid2]
183 200 [aid2; aid3]
199 111 [aid3]

I wonder if we shouldn't get into the mindset of always thinking of rows to belonging to one or more users whether or not the data is aggregated.

We could easily extend the AID contribution information to contain a per user count (here denoted by a tuple of form aid, number of occurrences:

Height Amount AID contribution
183 200 [aid1, 1]
181 200 [aid1, 1; aid2, 1]
183 200 [aid2, 1; aid3, 1]
199 111 [aid3, 1]

which would map perfectly onto the result of a subquery that does aggregation as well, which could yield a table such as:

Height Amount AID contribution
183 200 [aid1, 3]
181 200 [aid1, 2; aid2, 1]
183 200 [aid2, 1; aid3, 3]
199 111 [aid3, 10]

Please note that in all this I am not thinking about or considering the support for different AID classes. These are all problems we need a solution for, even in the base case of our system only providing support for a single AID type per database!

Hasn't this been already discussed in #15 and it is now specified in the "Multiple AIDs" section?

A table may have one or more AID columns (columns labeled as being AIDs). When there is more than one AID in a query (either because there are multiple AIDs in a table or tables have been joined), by default, Diffix treats them as distinct. In other words, as though they refer to different entities. In so doing, Diffix handles AIDs of different types seamlessly, and also is robust in cases where JOIN operations incorrectly mix AIDs together (i.e. a row for user1 is joined with a row for user2).

Hasn't this been already discussed in #15

Yes it has. Sebastian please have a look at what is written there and let us know if you still think there are issues...

Superseded by design docs