chef/chef-vault

Vault does not scale well with 2000+ nodes

rveznaver opened this issue · 12 comments

Hi,

This will be a long post, so brace yourselves :)
We have noticed that several scaling issues arise when we share a secret across more than 2000 nodes.

Firstly, the knife vault refresh takes ages and uses a lot of memory (since the searches return whole node objects). In addition, the search queries strain the SOLR engine on the Chef server to the point where we had to tune it to the max because it was running out of memory. This issue has already been somewhat addressed by #177 and #178. The latter may be improved by setting rows: 0 when we are only interested in the number of results (e.g. numresults = query.search(:node, "name:#{nodename}", filter_result: { name: ['name'] }, rows: 0)[2]).

Secondly, because of the way vault uses data bags, once a secret is shared, the _keys data bag item will contain an encrypted entry for each node. This results in very large data bag items that are wholly transferred over the network although a node requires only a single entry (the symmetric key encrypted by its own public key). For example, a 4K secret will have a 3.3M _keys item if shared across ~7500 nodes (sizes approximated using knife data bag show databag item_keys -Fjson). As one can imagine, several secrets may saturate the network on the Chef server.

These are the solutions I have thought about so far:

  1. Data bag item per node
    We could split the _keys item per node in two ways, either:
    • one key item per node containing all encrypted symmetric keys
      This would enable us to get all keys for a given node with a single query to the Chef Server (bare in mind I'm talking about keys, not secrets; those would remain unchanged and would take a couple more queries). However to fully optimise this solution, the Chef client would have to implement a caching mechanism whereas it would get the key item at the beginning of a run and hit the cache unless it cannot decrypt the secret (known race issue when a node wants to decrypt before the keys item is refreshed) or it wants to create/update a secret.
    • multiple key items per node containing a single encrypted symmetric key
      This would be simpler to implement, however it would create a lot of data bag items and would create more queries for the Chef Server. However, given that the items would be requested directly (i.e. not using wildcard searches), I do not think it would create that much of an issue for the SOLR engine. I'm not sure about the increased number of data bag items (one secret would create 7500 + 1 items in our case).
  2. Get rid of _keys items and add features to the Chef Server
    Instead of creating key items, we could encrypt/decrypt the secret with the Chef Server's private/public key pair and rely on ACLs for authorisation and HTTPS for encryption. So a given node would request an encrypted data bag item, the Chef Server would authorise it depending on ACLs, decrypt the secret, and respond with the decrypted secret. It would still be encrypted in transport since the Chef client connects over HTTPS, however it would require the Chef Server to decrypt the secret on each request. This would simplify the vault implementation and completely solve the aforementioned slow refresh issue, however it would require additional work on the Chef Server side and would effectively allow the Chef Server to decrypt all secrets (which is currently not the case). I am not certain if this could be avoided.

Questions, comments, and feedback are more than welcome!

Pinging @thommay as we have already discussed this issue at the Chef Community Summit in London

cc @stevendanna since he was part of this conversation in London

So, there is another limitation I've encountered at a customer.

The default settings for json on the chef server side limits json to around 1.5 megs. So, their vaults were not processing even delete commands. Had to knife data bag delete it to get rid of old clients.

I think there's no reason not to do 1(b) right now, even though we might want to contemplate doing something else later. I think my preference would be that for small numbers of clients - say less than 250 - we keep the same design as currently, but for any larger bags than that we go to one item per client - foo_key_nodename - and foo_keysto store the metadata. On the client, we'd simply try and fetch foo_key_nodename first and fall back to foo_keys if that was not successful.

If we turned on request pipelining for data bag items in chef it would be very cheap to fall back, too; and it would presumably give a fairly respectable speed increase when doing most knife vault operations too.

So with the merge of the above PRs (#246 and #252), is there anything else we need to do here?

Test a couple of secrets with a large number of nodes to see if it scales properly.

btm commented

Did anyone do that testing?

btm commented

I'm closing this as done. If there are remaining edge cases and you end up here, please speak up.

btm commented

I believe this was fixed by sparse mode and released in Chef Vault 3.1, which is included in ChefDK 2.

Any plans to support conversion of vault items on "default" mode into "sparse" mode?