w3c/trace-context

Can the intelligence-free nature of ids be confirmed or audited by external parties?

danielkhan opened this issue · 11 comments

See https://lists.w3.org/Archives/Public/public-trace-context/2020Feb/0004.html

Mitigations are mentioned for consumers of these fields to try to audit the randomness of identifiers, and the spec provides normative requirements with justification for why identifiers should be chosen as globally random.

Privacy and Security Considerations sections are both present in the spec. The Privacy Considerations and Security Considerations sections raised several questions and concerns for me in reviewing their updated states.

Note that these privacy concerns of the traceparent field are theoretical rather than practical.

This sentence seems to be incorrect and unhelpful. The documented privacy risks in the previous paragraphs seem entirely feasible, with particular normative requirements to mitigate them; “theoretical rather than practical” implies that they would not happen in any anticipated use, which does not seem to be the case. While it’s not uncommon for a privacy considerations section to consist of arguments for why described privacy risks are not serious (and discouraging use of known mitigations), it’s not clear that these arguments are helpful to implementers or end users.

Similarly, these normative requirements are potentially in conflict and appear to discourage mitigation of identified privacy risks:

Vendors extremely sensitive to personal information exposure MAY implement selective removal of values corresponding to the unknown keys. Vendors SHOULD NOT mutate the tracestate field, as it defeats the purpose of allowing multiple tracing systems to collaborate.

Is removing values from the tracestate field allowed or prohibited?

And it’s not clear how a potential implementer should determine whether it’s “extremely sensitive to personal information”: is that different from being compliant with binding legal requirements in multiple jurisdictions? Maybe a better, less editorialized phrasing would just be to say that "Vendors MAY remove values in order to limit disclosure of personal information.”

Vendors should ensure that they include only these response headers when responding to systems that participated in the trace.

I think “only” is misplaced and the intended sentence is:

[potential correction:] Vendors should ensure that they include these response headers only when responding to systems that participated in the trace.

Is this a normative requirement SHOULD? Throughout the Privacy Considerations section, I’m often uncertain as to when these terms are used for normative requirements and when not.

The Security Considerations section also has many statements that use RFC 2119 terms, but none are in all caps and it’s not clear if these are intended as normative requirements or not

@mtwo:

Can the intelligence-free nature of these identifiers be confirmed or audited by external parties?
Identifier randomness is up to implementations, with OpenTelemetry likely being the most prolific of these. OpenTelemetry is fully open source and auditable.

I think it was @samuelweiler in particular who had noted the concern about auditability of identifiers, so he might have more specific comments on this issue and this additional information.

It looks like Section 6.1 of the current Rec notes an algorithm that could be used to verify that an identifier is just a random hash of a timestamp, but doesn't recommend it. I don't know whether vendors who would make use of this spec would be interested in such an algorithm or not, or whether auditing of a common set of code (OpenTelemetry, as suggested) would be more useful or if vendors are in a place to help users by actively identifying whether these cross-system trace headers may include additional user info.

mtwo commented

Effectively all of the vendors associated with the spec are adopting and contributing to OpenTelemetry, though some may also support their own implementations (for example, services inside of Google Cloud and Azure, and non-OpenTelemetry agents belonging to Dynatrace and New Relic). It's unlikely that end-users will ever generate their own identifiers - rather the identifiers will be created by the OpenTelemetry SDKs or Dynatrace / New Relic's agents.

I second @mtwo but I think we should really think about getting rid of paragraphs like this going forward

Services may also define an algorithm and audit mechanism to validate the randomness of incoming or outgoing random numbers in the traceparent field. Note that this algorithm is services-specific and not a part of this specification. One example might be a temporal algorithm where a reversible hash function is applied to the current clock time. The receiver can validate that the time is within agreed upon boundaries, meaning the random number was generated with the required algorithm and in fact doesn't contain any personally identifiable information.

I think that this does not add anything to the spec as such. @SergeyKanzhelev WDYT?

Sorry have to disagree with @mtwo here. A spec is a spec, let's not conflate it with implementations please. There are some implementations now, not all of which mentioned here, and there will be new implementations in the future. The spec should provide constraints or not independently of what may or may not be unlikely.

So I do agree with @danielkhan on removing terminology that is too specific to implementations and carries no weight in terms of a semantic spec.

I also agree with @danielkhan here. If the spec states that IDs are generated randomly, then their intelligence-free nature is intrinsic (assuming a good implementation).

If the spec states that IDs are generated randomly

does it? When I search for "random", I only see it mentioned in passing in the privacy.md, not in the main description of trace ID field.

For the record, I am against requiring full randomness. It's an implementation details and different sites may have different reasons for reserving some bits of the ID for specific semantics.

If that's the case, then IDs aren't really intelligence-free are they? If some bits of the ID convey some semantic meaning, that is intelligence. It may not be user-identifiable intelligence, but it is intelligence none the less. Maybe we should prescribe that IDs should not contain, or be derived from, any PII?

yes, that - why is "intelligence-free" a requirement vs. "no PII"? There are plenty of scenarios where trace IDs are generated internally, not on end-user devices. Even PII may not be a concern internally (generally employers are allowed to monitor their employees). So let's scope the problem appropriately and not use a hammer.

The spec already mentions that PII information should not be used to generate trace ids. There's no additional actionable items here from the spec perspective per the discussion in today's working group meeting. Hence, closing this issue.

I'm unclear on why this issue was closed or why the specific issues raised were considered not actionable.

The WG seems undecided on whether traceparent ids are randomly-generated and intelligence-free or whether they can be generated however the implementer wants but shouldn't include personally-identifiable information (a term not defined in the spec). The spec says that the id consists of randomly-generated numbers, but it's not clear that's actually the intention. If these ids are required to be random, then it would be reasonable to address the question from @samuelweiler about auditing or otherwise confirming that they are random.

This incorrect and unhelpful statement remains:

Note that these privacy concerns of the traceparent field are theoretical rather than practical.

These inconsistent statements remain:

Vendors extremely sensitive to personal information exposure MAY implement selective removal of values corresponding to the unknown keys. Vendors SHOULD NOT mutate the tracestate field, as it defeats the purpose of allowing multiple tracing systems to collaborate.

May vendors mutate the field or should they not? How should a vendor know that it is "extremely sensitive to personal information"?