open-telemetry/semantic-conventions

Http client and server span default collection behavior for `url.full` and `url.query` attributes

vishweshbankwar opened this issue ยท 23 comments

Area(s)

area:url

Is your change request related to a problem? Please describe.

The HTTP client and server spans have url.full and url.query attributes, respectively, tagged as Required. These attributes can contain sensitive information, as mentioned in the spec. Even though the spec states, Sensitive content provided SHOULD be scrubbed when instrumentations can identify it, it isn't possible to identify this in a deterministic way.

https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md

This is an issue because users can onboard to the instrumentation library without being aware of the data collected by the library, resulting in a leak of sensitive information.

Describe the solution you'd like

The attributes that have the possibility to contain sensitive information are not collected by default. Changing this now will be breaking change, an alternate is to have a consistent opt-out option defined in the specification to opt out of the default behavior.

Describe alternatives you've considered

As a mitigation, users can provide processors to scrub information. However this is likely to be a reactive thing, we should have a safe-by-default stance instead.

Additional context

Related issue: #128

There is nothing officially documented, but I believe this #128 (comment) and #128 (comment) summarize the approach we follow now:

  • it's impossible to prevent PII/sensitive data collection on the instrumentation level without compromising observability usefulness
  • it cannot be an instrumentation author concern to scrub sensitive data and be aware of GDPR and other potentially relevant concerns.
  • we're missing:
    • a way to mark some attributes as ones that's more likely to contain sensitive/pii data
    • central SDK/collector component that'd do scrubbing to comply with GDPR and or other requirements relevant to a certain application

None of this is specific to url.full/url.query and if we stopped collecting those, it'd be more than just a breaking change - it'd render http client tracing useless.

Might be worthy to have dedicated columns for each attribute indicating the potential concerns from privacy/security perspective. For example:

attribute name privacy notes security notes
client.address this field could be a privacy concern especially when the client.address is coming from the end-user / consumer there is no known security concern
url.full the query string might have end-user identifiable information such as email address or user name the query string might contain credentials such as SAS token, the url might contain user/password

This was discussed today in both the semantic convention and maintainer meetings.

The feedback from these meetings was:

  • It seems better to treat the url query string similar to database statements ("SHOULD be collected by default only if there is sanitization that excludes sensitive information")
  • As this would be considered a security fix, it may be allowed as a breaking change

Two potential options stand out if we want to mention a specific sanitization format:

  • Option A:
    • http.query - ?REDACTED
    • http.full - https://example.com/path?REDACTED
  • Option B:
    • http.query - ?abc=REDACTED&xyz=REDACTED
    • http.full - https://example.com/path?abc=REDACTED&xyz=REDACTED

I will bring this discussion to the Specification meeting tomorrow in order to gather more community feedback, but please feel free to also post any concerns here.

As this would be considered a security fix, it may be allowed as a breaking change

I looked through the semver spec and wasn't able to find verbiage that supports this. I may have just missed it, but in the interest of clarity can you please point me to the wording? While I agree with the end result, I think we should be clear on if this is something semver explicitly allows, or something we're explicitly violating semver for. If it is the latter, the communication of the change is even more important. In particular, we should be clear if this is going to be a policy for OTel moving forward.

SemVer is specific to public API surface. If we try to apply it to semantic conventions then presence of the attribute with specific name and type is a public contract. The value of the attribute is not and it's allowed to change - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#semantic-conventions-stability

In the maintainer meeting today, I (incorrectly) thought semver allowed breaking changes for security fixes, which is where @dyladanโ€™s comment is coming from.

Luckily though as @lmolkova points out, the OTel semantic conventions definition of stability specifically allows us to change attribute values ๐Ÿ˜…:

Things not listed in the above are not expected to remain stable via semantic convention and are allowed (or expected) to change. A few examples:

  • The values of attributes

In the maintainer meeting today, I (incorrectly) thought semver allowed breaking changes for security fixes, which is where @dyladanโ€™s comment is coming from.

Luckily though as @lmolkova points out, the OTel semantic conventions definition of stability specifically allows us to change attribute values ๐Ÿ˜…:

Things not listed in the above are not expected to remain stable via semantic convention and are allowed (or expected) to change. A few examples:

  • The values of attributes

The OTel .NET instrumentation libraries breaking change is also resolved with this.

(I was also incorrectly assuming security/critical bug fixes are allowed by semver without major v bump!)

we're missing:
central SDK/collector component that'd do scrubbing to comply with GDPR and or other requirements relevant to a certain application

To my knowledge this is not true, the redaction processor was created specifically for this compliancy question

My understanding of query parameters is that, while they can be abused to leak sensitive information, that is considered a mistake. Is there anything in the w3c that defines query parameter as a sensitive field?

I would much rather redaction be an opt-in mode rather than the default. An opt-in option give users who want the extra level of security an option while not breaking end users who are in control of their instrumentation.

My understanding of query parameters is that, while they can be abused to leak sensitive information, that is considered a mistake.

Query params are commonly used for authentication.

E.g. it's supported by AWS S3, Azure Blobs, Google Storage

That feels like an edge case rather than the norm. While auth parameters can be used in query param, I have not heard that it is a norm. Those uses cases feel like a reason to make redaction an option, but not the standard.

Any instrumentation for those uses cases could make a decision to opt-in to redaction by default. But I don't think every instrumentation library that records these http semantic conventions should be forced to redact by default via the semantic convention.

I don't know that I'd agree that it's an edge case, but I would suggest that out of the universe of people using OpenTelemetry there are more cases of people who aren't passing in auth tokens (or, perhaps, who desire that functionality and would like those values to be unredacted) vs. people for whom it would be a surprise. I also feel like there are many existing solutions here in the ecosystem, e.g., redaction processor.

Given that changing the default would be an explicit and implicit change in behavior that modifies years of precedent, I would be far more supportive of a new opt-in behavior to redact all values; Preferably, I would suggest that the spec requires all semconv attributes which contain potentially sensitive data to offer customizable redaction by default.

I appreciate that this can be a source of leaking potentially sensitive information. Do we have some data or some other kind of evidence suggesting this is:

  1. A prominent source of PII or other sensitive data being leaked in practice
  2. The source of one or many incidents where people clearly expected this telemetry to have data masked by default

Right now, the justification for masking things because they could opt people into capture PII or sensitive information could extend to so many other things in the spec.

For example, I talked with a customer last year who was concerned about how some of their URLs contain PII and they're captured automatically. They clearly got value out of capturing these values generally, but weren't aware of a way to redact them. The redactionprocessor did what they needed on the tee, but they hadn't adopted the OTel Collector yet. They finally did, and it's not a problem now!

This basically met their expectations: they know there's inherent value in capturing this information, but they just need to know how to tweak controls so that things don't leave their network.

I would propose that this should be the guiding principle about capturing values which may have PII or sensitive data in them by default. And if there are instrumentations that, by default, include sensitive data that they really ought not to, then we should fix those instrumentations.

The cost of leak can be extremely high even though they are rare.

One of the common use-cases is to drop instrumentation into production app (i.e. enable OTel using cloud provider CLI/UX for a running app). For this case, I believe safe defaults are more important than precise telemetry.

In the same way that users can discover redaction processor (which requires them to run collector), they can discover that some details are missing and enable them.

But I think that the distro should be able to change instrumentation behavior. This allows instrumentations to play safe and distro to provide the right configuration story and defaults specific to the vendor.

But I think that the distro should be able to change instrumentation behavior. This allows instrumentations to play safe and distro to provide the right configuration story and defaults specific to the vendor.

Respectfully, I don't feel like a solution that requires everyone to create a distro because the reference implementation does unexpected things (or things that makes the reference implementation less useful) is a good one. That strikes me as a great way to actually blunt the adoption of OpenTelemetry and reduce the amount of contributor-hours we get from vendors, since they'll need to instead use that engineering time to create and maintain their own distribution.

I want to better understand the valid use cases that are being used as justification for this change. Putting a password in a query string would be considered a mistake, and wouldn't, in my opinion, justify the redaction by default - you could easily write a password to a log body and we wouldn't consider redacting those.

What about the authentication methods mentioned above (AWS S3, Azure Blobs, Google Storage) makes it safe to use query string for authentication but not to record the values? If it is because the tokens can expire, doesn't that make them safe to record? I am genuinely ignorant about these authentication methods and they are in conflict with my current mental model of URL security best practices.

I also want to challenge the notion that this is not a breaking change as defined by the Spec stability conventions. The Spec states:

Semantic Conventions defines breaking changes as those that would break the common usage of tooling written against the telemetry it produces. That is, the portions of telemetry where specialized tooling (alerts, dashboards, e.g.) interact are expected to remain stable for that tooling after schema transformations are applied. These also assume no user interventions in the default configuration, e.g. Samplers, Views, etc.

This change as it is being proposed will certainly impact specialized tool and will need user interventions. I see that the spec also states:

Things not listed in the above are not expected to remain stable via semantic convention and are allowed (or expected) to change. A few examples:

  • The values of attributes

and I guess I want to challenge the application of that rule to this situation. My interpretation of that statement is that the semantic convention doesn't guarantee any specific values for attributes that are not specific list, i.e., we don't promise that url.full will always have a specific value, because how could we guarantee that - there are infinite urls. But in this situation we are breaking the value by means of breaking the way we record it and the way the user interacts with it. We would now be modifying the recorded value for url.full or url.query and the way a user interacts with those fields changes because of this proposed change and that feels breaking to me, especially bc they are already marked stable.

Finally, if this change does get merged as is, we MUST evangelize this change broadly. The average user isn't going to care that the spec could be interpreted to allow this change as non-breaking. All they are gonna see is failed queries, missed triggers, and broken dashboards after an instrumentation upgrade. That is a horrible user experience. Adding something to the semantic convention release notes is not sufficient, we must enforce a rule that instrumentation release notes/change logs accurately reflect this change.

Is there a provision for restoring the previous behavior, for all the active users who are depending upon the current standard values?

Is there a provision for restoring the previous behavior, for all the active users who are depending upon the current standard values?

having a consistent way for users to select their desired behavior seems important whether we change the default behavior or not

there's a couple of related comments also on the PR:

Finally, if this change does get merged as is, we MUST evangelize this change broadly.

+100

The average user isn't going to care that the spec could be interpreted to allow this change as non-breaking. All they are gonna see is failed queries, missed triggers, and broken dashboards after an instrumentation upgrade. That is a horrible user experience.

That's exactly what I experienced when we upgraded the AspNetCore instrumentation library from 1.8.0 to 1.8.1 ๐Ÿ™ƒ

Two potential options stand out if we want to mention a specific sanitization format:
* http.full - https://example.com/path?REDACTED

Unfortunately, it's also possible to have https://example.com/sensitive/info/in/path (OData, IDs like emails).