open-telemetry/semantic-conventions

Semantic conventions vs GDPR

pellared opened this issue ยท 20 comments

Per https://github.com/open-telemetry/semantic-conventions/blob/9b455310519ec511656f91d1db0e30f5e32acd2a/specification/trace/semantic_conventions/http.md#http-client

url.full is currently Required.

However, the URL can contain sensitive data e.g. personal data (PII) in GDPR terminology (e.g. login, ID).

GDPR adds many rights to the subjects a lot of rights that may be very problematic (e.g. https://www.digitalguardian.com/blog/google-fined-57m-data-protection-watchdog-over-gdpr-violations).

Maybe it should be Recommended similarly to device.id which also has the following notice:

**[1]:** The device identifier MUST only be defined using the values outlined below. This value is not an advertising identifier and MUST NOT be used as such. On iOS (Swift or Objective-C), this value MUST be equal to the [vendor identifier](https://developer.apple.com/documentation/uikit/uidevice/1620059-identifierforvendor). On Android (Java or Kotlin), this value MUST be equal to the Firebase Installation ID or a globally unique UUID which is persisted across sessions in your application. More information can be found [here](https://developer.android.com/training/articles/user-data-ids) on best practices and exact implementation details. Caution should be taken when storing personal data or anything which can identify a user. GDPR and data protection laws may apply, ensure you do your own due diligence.
?

Maybe should add some notice that the URL can contain sensitive/personal data and one may consider to delete it using OTel Collector's attributesprocessor?

Maybe the collection of this attribute should be configurable?

The same concerns apply to db.statement.

PS. I am sorry that the issue has questions than answers.

What I suggest is:

  1. Identify attributes that has probability (which is not very low) to contain PII (Personally Identifiable Information) e.g. url.full, url.path, url.query client.address, client.socket.address, db.statement.
  2. For each of these:
    2.1. Add caution note e.g. The value may store sensitive data (e.g. personal data, personally identifiable information). GDPR and data protection laws may apply, ensure you do your own due diligence.
    2.2. Make sure that it is NOT Required. So that people who do not use collector may not collect them.

We could also consider adding some env var like OTEL_SEMCONV_HARDEN_OPT_IN to make it easier to disable such attributes.

I see you had already contributed in that area in the past: open-telemetry/opentelemetry-specification#1502
Could it be that these considerations got lost in the current rewrite?
CC @lmolkova

@Oberon00 In open-telemetry/opentelemetry-specification#1502 I was mostly concerned about leaking login+password (I find it a lot more critical).

I felt that mentioning GDPR/sensitive data would not bring a lot of value 2 years ago as semantic conventions were in a very early stage. I was afraid that it would cause more confusion and paralysis. Right now I see a lot of contributions in semantic conventions. I think that it is a good moment to start doing something with privacy/GDPR/data protection.

I want to add one point - exception callstacks are commonly considered as privacy data.

Today, I learned that for one of our customer uses client (mobile) instrumentation (Android, iOS). The PII data emitted via HTTP Client instrumentation is very problematic for them. Because the instrumentation works on "end-user" devices therefore it is not easy to get rid of them. Ideally they would prefer to not emit them from the devices at all (via network).

I want to add one point - exception callstacks are commonly considered as privacy data.

I think it usually leaks "internal details" of the instrumented system (most telemetry is causing leakage of some internal details). I do not think it would be easy that they leak "personal data" via exception callstack UNLESS they also contain parameter values. I am not aware of any language/ecosystem which dumps parameter values in exception callstacks.

@Oberon00 In open-telemetry/opentelemetry-specification#1502 I was mostly concerned about leaking login+password (I find it a lot more critical).

One extra point - login/password leak can happen anywhere. I'll give an example from Microsoft Azure - in many places we allow users to put "tags" (very generic, arbitrary strings that can be associated with some entities) which can be used to group or search things. And we know that some users could put sensitive information there (e.g. it could be their emails or even passwords), so we consider all of these privacy data and put lots of efforts on redaction, classification, isolation and access control.

Another typical issue is that people might put something wrong by mistake when they instrument, in a large system we do see developers making mistakes (e.g. putting user email address in an attribute named "ResourceType"). I feel schema cannot solve these problems, a centralized scanning/redaction system can provide consistent/reliable guarantee (which comes with perf cost for sure).

I feel schema cannot solve these problems

For sure it cannot solve it. It is more about Defense in Depth and making the hardening more straightforward and adding protection on more layers.

I really like the idea of env. variable or something similar to do best effort of anonymizing PII data. I think it is crucial to align on the best effort of not sending this data outside of client devices.

We could also consider adding some configuration/feature to the SDK that would allow attribute retraction (e.g. via exporter decorator). This could help user's were retraction using OTel Collector's attributesprocessor is problematic (like here). But without additional hints the user may still have trouble to find telemetry which has "not low" probability to contain PII or other sensitive data.

I had a conversation with @trask, and here is the summary:

Our current focus lies on the HTTP semantic convention as it's planned to be stable and can serve as an example for future semantic conventions.

During our discussion, we identified the following attributes as problematic:

  1. url.full
  2. url.path
  3. url.query

Regarding url.path and url.query, we both agreed that these attributes could be changed from Required to Recommended. We propose adding a note like: The value SHOULD be captured by default. The value may store sensitive data. GDPR and data protection laws may apply, ensure you do your own due diligence. Instrumentations SHOULD offer a way to not capture this attribute.

However, for url.full, we haven't reached a satisfactory solution yet. We discussed two possible options:

A) Apply the same approach as for url.path and url.query. Nevertheless, this would cause the telemetry to currently miss information, such as the URL scheme.

B) Implement an opt-in functionality to scrub/retract the path and query parts from the URL. However, this leads that most of the data is redundant (except for the missing URL scheme).

Personally, I lean towards option A for since I don't consider URL scheme to be critical telemetry. Moreover, we can always add url.scheme to HTTP Client later, for example, as Recommended or even Required. Lastly, adding additional scrub/retraction functionality would be more complex and bug-prone than simply not collecting an attribute.

On the other hand, we are both not sure if such attribute retraction should be at the instrumentation level, we both think that maybe the SDK should offer a more general way to suppress specific attributes. Maybe SpanProcessors (rather impossible) for spans and Views (it only contains "allow-list", no "deny-list") for metrics are the way to go? EDIT: I got a feedback from @MrAlias that doing filtering on higher level would lead the resource consumption overhead that may not be acceptable. Not collecting unwanted attributes on instrumentation level would be more efficient.

I'd like to have a broader OTEL-wide discussion on handling GDPR concerns. Specifically:

  • I'd like to separate the concern of semantic convention instrumentation author recommendations and providing users a solution for o11y in context of GDPR. This means that I think from a semconv perspective we should focus on outling where sensitive data can be written using something like a sensitivity annotation in the YAML.
  • I think we should have SDK/collector features that allow users to customize their GDPR story w/ o11y. There is not a 1-size fits all solution. Just removing the data is the easiest but not necessarily the best. A compliant o11y datastore for GDPR could also work, depending on how/where the data is sent and stored. As such, we should give users flexibility.

I'm going to raise the discussion in the next TC meeting to see if we can get alignment on a direction here and what thoughts exist from others. I suggest this is worth some good brainstorming.

@jsuereth Any update?

I just spotted

In some semantic conventions, the data collected as a span attribute could
contain PII (Personally Identifiable Information). As a general guideline, do
not collect this data by default.

in open-telemetry/opentelemetry.io#3309

trask commented

related, we have had PII data reported a couple of times now in exception.message: open-telemetry/opentelemetry-java-instrumentation#3039

I think, for o11y, it's impossible to NOT collect PII. At this point our approach should be the following:

  • Let's make sure we can apporpriately FLAG things that could be PII with annotations.
  • Otel Users need CHOICE for whether they want PII generated at the instrumentation level to leave that machine. This means SDK features or Collector features to ensure redaction of PII where necessary.
  • We expect users of OpenTelemetry to understand that it is better to provide backends systems compliant with GDPR rather than give up all aspects of o11y that are needed for debugging (e.g. URL). In lieu of this, the CHOICE to redact should be sufficient. Both situations will be provided for via annotations.

The guidance to not generate PII by default is temporary, pending better enable/disable features in the SDK for full user control.

SIG meetings notes.

It appears that the general agreement on how to handle the problem is as follows:

  1. Include a marker for the attributes that might contain personally identifiable information (PII) or sensitive data. This way we clearly can document which attributes may need special care.
  2. Establish the _RETRACTED value, which can be employed by instrumentation libraries to remove attribute values. This way there is still a value for the Required attributes and moreover it is transparent what attributes are retracted.

After some thoughts, I find

are good enough for solving the issue.

It is also inline with my original proposal: #128 (comment)

I wanted to add a paper I found pretty useful for folks evaluating solutions here.

Anonymity Assessment โ€“ A Universal
Tool for Measuring Anonymity of Data
Sets under the GDPR with a Special
Focus on Smart Robotics*

This paper covers the background material, a bit of info in terms of legal qualifications and mathematical principles for evaluating anonymity. It may be a good foundation for starting a taxonomy for semconv.