Proposal: Allow SP to generate secrets at CreateVolume and ControllerPublish time
bswartz opened this issue · 4 comments
In the CSI spec today, all secrets are stored on the CO side, and sent to the SP side at appropriate times. This covers use cases where the secrets are administrator-created (such as login credentials for the storage device) or user-generated (such as per-volume encryption keys) but there's a third class of use cases where the secret it meant to be used by the node to securely connect to the storage device.
For example, iSCSI CHAP secrets fall into the 3rd category. The node needs them to connect to an iSCSI LUN, and they're sensitive information, because if at attacker obtains the CHAP secrets, he could use them maliciously. Typically, neither the user or administrator is interested in knowing such secrets -- they just need to be securely arranged between the storage device and the CO node.
Given the existing design of CSI, the options for supporting this third use case are all suboptimal. An SP may:
- Require that the CO provide all such secrets. This creates a burden on administrators or users to generate secrets and store them within the CO, so that they may be passed to the SP using the existing secrets mechanism. SPs can implement hacks to do this work directly on the CO, breaking the design of CSI.
- Generate the secrets in the SP side and provide them back to the CO in the volume or publish context. In this case the secrets are transmitted in plaintext, and logged in the sidecars, and also stored in plaintext. This is simply not secure.
- Like (2), but encrypt/decrypt the secrets on the SP side. This prevents the plaintext secrets from being observed by attackers, but still results in encrypted data getting logged by sidecars.
It would be better to modify the CSI spec to allow SPs to return both secret and non-secret context information for volumes at CreateVolume and ControllerPublish times. We could mark the additional returned string map as "secret" thus preventing sidecars from logging it, and allowing the CO to store such information securely (as securely as any other secret required for correct SP operation).
As stated in the writeup of this ticket, there's already a mitigation for this case: the plugin encrypts data before passing it back in the volume context, and the CO plays back the same, per-volume context in future calls. To expand upon that case, plugins could, instead, pass back a verifiable, signed token (e.g. JWT, keytab, x509 cert, or something similar that is, perhaps, not time-limited) in the volume context, and the CO replays that to the plugin - at which point the plugin validates the token and exchanges that for a secret that it needed in order to interact securely w/ a backend system. There are pros/cons of each approach. An important bit here is that the CO is permitted to safely log volume context, and so care should be taken to avoid leaking anything too sensitive here.
Since there's already a way to mitigate this from a plugin perspective, it does not seem mandatory to solve for this particular use case via CSI spec changes. If any change is needed, it seems that the minimal requirement is that a plugin can describe to the CO "maybe don't log some subset of information that i'm handing to you". Then again, if it's not safe to log .. is it really safe to persist to disk in cleartext? We could craft a "classification" scheme which differentiates among several sensitivities, but that seems a bit over-engineered? Any "must be persisted securely" classification brings us back to the original motivation of this ticket.
ALL THAT SAID...
Let's boil down the scenarios being mentioned here:
A. (today) CO passes secrets to plugin in RPC requests
B. (proposed) plugin passes secrets to CO in RPC results; CO replays those secrets in future RPC requests
To be clear, the intended scope of (A) is not "static" secrets (which are generally configured by the thing that executes the plugin binary and passes this information via envvar or flag), but "dynamic" secrets that can change from call-to-call. Passing dynamic secrets from CO to plugin does not dictate where this secret information is initially configured and/or stored. There's an implicit assumption that either an administrator or user knows enough about the volume provisioning process to wire things up so that the CO has access to the right secret at runtime, particularly at the point in time at which the RPC is executed. There is no implicit assumption or explicit requirement that the CO is involved in the storage of such secret information. A lowest common denominator CO simply acts as a pass-through system for secrets that are persisted elsewhere. COs are never expected to safely store sensitive information for any amount of time, under any circumstance. The spec does say that sensitive information passed through RPCs like this should be treated as such, and so LCD CO implementations are responsible for the safe handling of it, in transit.
Some consequences of (A):
- secret lifetime is limited to the scope of an RPC call. The next RPC call can send different secrets, or not. CSI doesn't know or care, outside of the requirement that requests are idempotent.
- CO is never asked to participate in the lifecycle management of sensitive information.
It's somewhat orthogonal, but worth noting that the CSI spec plays no role in actually securing data in transit. The spec recommends the use of a UNIX socket, but in practice gRPC calls may also be made over TCP sockets and CSI does not prescribe a solution for securing such communication. It's assumed (probably generally, but at least by myself) that uses of TCP w/ CSI gRPC probably involve a TLS-secured channel, and that such approaches are "good enough" for securing information in transit.
Somewhat more directly related is that CSI APIs are idempotent. While there are different perspectives exist on the philosophy of idempotent operations, a common ground seems to be that the same API call made repeatedly, with the same request parameters, should not result in a distinctive state change of the target system: once the first call has succeeded, repeated calls have no side effects. This seems possibly important w/ respect to replay attacks, and secret lifetime, if we begin to ask a CO to participate more actively in secret lifecycle management (B).
The scope of (B) seems considerably more involved than that of (A).
- idempotent RPCs seem misaligned with the need to periodically rotate secrets (which is a best practice AFAICT); e.g. if issuing the same CreateVolume RPC could result in secret rotation (changing the response, and backend state) then it's no longer idempotent. if the CO happened to issue two calls in parallel w/ the same request params (not sure we allow this, but it could happen), then the order in which they return could determine which (distinct) secret is observed by the CO: this is not desirable.
- if a secret expires (on the backend) and another CreateVolume call is issued before renewal takes place, the call may fail even though the request parameters are identical to what the CO may have pass previously. the CO has no way to mitigate this other than to repeatedly retry the failed request. if the secret is rotated (to resolve expiry) then... see previous point.
- CO becomes responsible for coordinating persistence of secret information. not all COs can do this (e.g. Mesos can inject secrets into a call path, but has no mechanism by which to store them for plugins that need/want this).
- such an API nudges the spec a bit closer into treating the CO as a datastore (and, a secret datastore at that), something which we have been very very reluctant to do in the past. CO replay of fixed, non-sensitive volume context per-volume was where the line was drawn previously.
In summary:
- There are already mitigations available to plugins by which the problem may be resolved (re: logging sensitive information).
- Not all COs can support persistence of sensitive information; augmenting the spec to allow plugins to demand this of COs raises the bar considerably for COs to the point of eventually excluding them from the ecosystem (in direct contradiction of the spirit of the CSI spec).
- Involving COs in the secret lifecycle seemingly intersects w/ request idempotency in interesting ways, and more thought is needed here if we intend to actually pursue this.
- Involving CSI with secret rotation requirements seems generally "out of bounds" w/ respect to the problem space that CSI intends to address.
👎 on this proposal
I'm sympathetic to the idea that COs should not be treated like a general purpose data store, and that storing secrets is a uniquely complex responsibility. The problem is that somebody has to do this work, and pushing the responsibility away from the CO doesn't make it go away.
SPs currently rely somewhat heavily on the volume context / publish context mechanism of controller/node communication, because there are good architectural reasons to avoid node plugins being able to access the storage device for any purpose other than data access. This means that a communication channel between controller and node is required, but in order to ensure scalability we use the CO the provide this channel rather than asking the SPs to invent their own communication channel.
I feel like the mere existence of the volume/publish context string maps are an admission that it's better for the CO to handle this controller to node communication than to require SPs to sort that out themselves. Clearly there was attempt to balance this burden on the COs by limiting it's scope (only 4KiB of data) and direction (only controller to node, with the exception of the NodeIDs). The constraints on this communication channel however, (in this case, lack of security) continually generate incentives for SPs to not use the CO-supplied mechanism and to instead invent their own communication mechanisms.
I'm struggling to know if we've drawn the line in the right place, given that SP authors keep finding it insufficient. I wonder if it was a mistake to even attempt to obviate the need for a SP-managed communication channel between nodes and controllers, and if it would have been better to encourage direct communication from the beginning.
I'm struggling to know if we've drawn the line in the right place, given that SP authors keep finding it insufficient.
This is an interesting point. There was another, related discussion about CSI introducing a general purpose communication-hub/bus API, and we very intentionally decided that was out of scope: plugins that need anything other than simple cookies are on their own to implement (or leverage, via some other infra component) a more complicated communication channel. This mostly seems to resurface every time a plugin author wants their node plugin to communicate back to the controller plugin.
The constraints on this communication channel however, (in this case, lack of security) continually generate incentives for SPs to not use the CO-supplied mechanism.
I think it probably depends on the use case here, but I could be wrong. It's been a while since I've dug into various OSS CSI implementations to see how this is being leveraged. One of the challenges is that the spec tries to accommodate KISS plugins, alongside those that are much more heavy weight. Along with supporting multiple plugin deployment architectures. If we cover 80% of use cases, is that good enough? Are we even hitting that mark?
Back to what's being proposed this plugin: I've had a brief chat w/ core Mesos folks. Take aways:
- there's nothing about the Mesos arch that prevents support for something like "secret_volume_context" from being implemented, someday (perhaps in terms of a Mesos module that provides a "secret_committer" interface). In terms of current implementation, Mesos doesn't support this and would need significant extension (read: developer time).
- the CSI spec doesn't really account for CO capabilities (or "alpha" CO features), only plugin capabilities. so there's not really a great way (in terms of current spec) for a CO to advertise "i can support persistence of sensitive information". now, we could add a CO capability mechanism and proceed along those lines .. but the fact that we haven't needed it yet triggers warning bells in my mind: maybe we're tackling the core problem the wrong way?
Another thought: given solutions like Vault's "transit" engine (or other things that e.g. SOPS can plug into), I'm wondering why asking plugins to do the work of encrypting sensitive information (which only some plugins need to do) for use within an insecure context is overly burdensome. After all, it seems good enough for gitops use cases.
The way we've handled CO "capabilities" in the past is by adding new optional arguments to RPC calls. Callers that support the new capability assert so by setting the argument to the non-default value. This is a signal to SPs that they can leverage additional return values. The spec would have to make clear this requirement -- that the return value is ignored unless an input parameter has a particular value.
I'm with you on the fact that it's very hard to make a one-size-fits-all solution at the CSI spec level, and that it makes more sense to aim for the 80% case. Maybe where we could do some useful work would be to spell out what we think the limits of the architecture are, and give some guidance to the 20% on what they should consider instead when they run into the limits of what the spec allows.
The example your raised 3 comments above, about secret rotation, is an excellent example of where the spec doesn't offer the kinds of tools one really needs, and it would be helpful to spell out how SPs are expected to tackle those kinds of problems. I'd like to have a library of proofs-by-example that these problems are in fact solvable without changing the spec, so that we can point developers there first when they complain about perceived deficiencies in the existing architecture.