zalando/restful-api-guidelines

Add custom property for data classification.

tkrop opened this issue · 6 comments

tkrop commented

As far as we see from #613 it make sense to add a custom attribute for data classification on schema properties and parameters (may be also on schemas, paths, or the whole API as explicit default) to support reasoning about the sensitivity and protection of data for logging, tracing, and monitoring. This would allow to build infrastructure and libraries that automatically filter or protect sensitive data based on API specifications.

Our data classification is following a standard schema consisting of for colors green for public, yellow for internal/general, orange for confidential, and red restricted/highly confidential.

Since API specs defines a general level of protection on each endpoint by authorization, there is already a general classification of endpoints and responses according to the following schema:

  1. No authentication and authorization maps to public (default green),
  2. Authenticated users maps to internal (default yellow), and
  3. Authorized users maps to confidential (default orange/red).

So from and endpoint perspective it would be enough to add a critical classification for endpoints exposing critical data requiring additional access restrictions. We can use this endpoint classification to derive a minimum bound for the data classification of schema properties and parameters, by saying that data is classified on the same level as the endpoint or below.

However, for logging, tracing, and monitoring it would be better to know exactly which data is above the supposed classification level, e.g. authenticated, to filter (or split) the provided data. To enable this, we would need to assign a data classification directly to each schema property and parameter. The available classification levels for this will depend on the required filtering:

  • If log-files are classified as confidential, we only need to flag restricted properties and parameters. A simple solution could use the tag x-sensitive: <true|false>.
  • If log-files are supposed to be classified as internal, we need to flag confidential and restricted properties and parameters.

A general tag pattern could look like x-data-class: <public|internal|confidential|restricted>, where internal could be used as default, if no tag is provided.

Thank you @tkrop, this sounds like a great idea! I am very much in support of this, it could significantly help with the issue at hand without degrading the experience for our developers. Please let Katalin and myself know how we can help with this, we'd be very keen to support!

Proposed is a security classification provided as part of the API definition (on level of API scheme elements). The approach is analogous to the path we used to follow for data schemes in the data lake, but meanwhile have been discarded for various reasons. It is an option, but should be discussed in the context of an overall security architecture (not a guideline issue) we align for the future.

For the time being, I understood from our guild discussion the following:
Our mandatory requirement to be solved: If PII data is logged, access must be restricted to the owning team (with max 30 days retention time). For logging via Scalyr the requirement is fulfilled. For logging via Lightstep and Skipper it seems to be not fulfilled in situations where URLs contain personal data. We should analyse and align on the solution options here and then discuss whether / how a guideline change is helpful. One solution option could be logging with anonymization or deletion of ids used in URLs -- either automatic or manual as drafted in #618.

@msooszalando However, also wrt to the URL logging problem I propose that we start with a security solution narrative (and not with a guideline issue). We are happy to support here.

tkrop commented

I agree, that the proposal must be considered within the context of a wider security/privacy architecture.

However, from a simplistic perspective, APIs are the core of such a proposal, since they cover the main paths of data exposition. If we assume that data stores are completely encapsulated by services, the only other path of data exposition is provided by logging/tracing, which is at least partially covered by the API specification. To completely cover all data exposure, we would need to extend the API specification to logging and tracing of calls - which translates to: no logging/tracing without proper specification.

@msooszalando We have touched this problem a couple of times during the last two years in our meetings without consequences. If you can provide a narrative as foundation of a decision, this would be great.

Hi All,

Given what's above, I believe the narrative is indeed the right approach. I will align with Katalin to see what we can do here. In the meanwhile, would #618 be OK to go ahead with (given some alignment), or must that also be part of such a narrative? That seems like a less intrusive solution to a problem that's more prominent. If we can get issue fixed, I believe this issue, and its associated workload, can be picked up with less urgency. What do you think?

Thanks,

Mate

@msooszalando Thank you for taking care of a narrative. I propose not to start with the 'overall API security architecture' now, but focus on the Personal Data Logging problem (as discussed and sketched above). In context of this narrative, we will clarify if #618 is the right option to follow.

ePaul commented

Waiting for the narrative before we do any guideline changes here.
Please continue any discussions regarding #618 there.