WICG/trust-token-api

Issuer to origin information rate is surprisingly high

martinthomson opened this issue · 3 comments

Each issuer is identified by origin. An origin can identify two issuers, but those issuers can be the same site. This effectively squares the amount of information that an issuer can transmit. Given that the issuer can have 6 active keys (VOPRF) or 3 active keys and 1 bit of private data (PMB), this allows the issuer to partition their user population into 36 different cohorts.

As the API is prompt free and can be iterated over time AND the issuer is able to rotate keys and issue new tokens every 24 hours (on average), this is an overall information transfer rate of ~5.17 bits per day. This is higher than the document claims and enough to uniquely identify all (or most) Web users in less than 6 days.

Yes to first paragraph.

During issuance operation, tokens generated from old keys are deleted (Step 10 in Section 6.1 https://wicg.github.io/trust-token-api/#issue-request). Bits stored in previous tokens are lost. This prevents ~5.17 bits per day argument.

I don't see how that discarding old values helps if there are 6 (or 3) values being passed.

After reviewing the algorithm more closely, I'm now questioning the claims about information transfer. The values 6 or 3 are not mentioned in any of the algorithms in the specification, so this issue was raised based this claim on the text in the introduction, which says:

A token stores a value from a set of six values

But I can't see how the algorithm supports that conclusion. To be clear, this might be good from a privacy perspective, but this needs to be clearer if the algorithms cannot deliver on this promise.

From what I can see in the algorithms, the user agent only picks the "latest"1 unexpired key from the key commitment, which - assuming that the key consistency mechanism works - would seem to make it impossible for an issuer to have its clients partitioned. It could only choose different keys when new commitments are made (where it can update keys, identifiers, or expiration times). But then the issuer is not really able to choose which key is used and so it can't choose from one of 6 values. Only the private metadata bit carries information now.

A weaker consistency scheme2 would allow a server to segment users by providing different expiration dates for the same keys, so that clients could be put into one of the six (or three) groups at the time they are provided with key configuration. Of course, with a weaker consistency scheme the issuer could just use the 32-bit key identifier to group clients, probably into more numerous groups than just 6 (or 3), because the key identifier is carried throughout the entire protocol flow.

If this does involve discarding and the claims about the number 6 are just incorrect, that's much better from a privacy standpoint. Then you have only the two issuers each being able to pass a single bit of information, so only 2 bits per day (or 4 with PMB).

Footnotes

  1. The "latest" is defined based on the order in which keys appear in the list.

  2. Consistency is an aspect of the specification is not clearly-enough specified for me to judge how strong the related claims really are.

Yeah, I think this is an inconsistency with how we were thinking about treating the "6 keys" as separate keys or sets of keys. I've filed #233 to fix the spec to clarify that the procedure returns the six most recent unexpired keys. That combined with some implementation-defined limit on how often the issuer can change their keys (currently planned for 60 days in Chrome) should hopefully avoid the additive identification of users in the short-term (though its true that your attack can just be done 60 times slower if the issuer rotates their keys at every key rotation opportunity).