alan-turing-institute/data-safe-haven

Stack encrypted key does not match project encrypted key - intermittent deployment error

craddm opened this issue · 20 comments

✅ Checklist

  • I have searched open and closed issues for duplicates.
  • This is a problem observed when deploying a Data Safe Haven.
  • I can reproduce this with the latest version.
  • I have read through the documentation.
  • This isn't an open-ended question (open a discussion if it is).

💻 System information

  • Operating System: Debian Bookworm
  • Data Safe Haven version: develop

📦 Packages

List of packages
Paste list of packages here

🚫 Describe the problem

When deploying an SRE, the first attempt at deploying sometimes fails with the error Stack encrypted key does not match project encrypted key

On repeating the deploy command, deployment proceeds as normal.

🌳 Log messages

Relevant log messages

image

♻️ To reproduce

Deploy an SRE

@craddm I think this should have been fixed in cbf0d04. Are you able to reproduce it now?

yes, just got it again

When creating a local workspace, it seems like it generates a new encrypted key while ignoring the encrypted key that is passed to it. self.stack_settings on L128 has the right encrypted key, but once the workspace is created

self._stack.workspace.stack_settings(self.stack_name).encrypted_key

shows that it has a different key.

self._stack = automation.create_or_select_stack(
opts=automation.LocalWorkspaceOptions(
env_vars=self.account.env,
project_settings=self.project_settings,
secrets_provider=self.context.pulumi_secrets_provider_url,
stack_settings={self.stack_name: self.stack_settings},
),
program=self.program,
project_name=self.project_name,
stack_name=self.stack_name,
)
self.logger.info(f"Loaded stack [green]{self.stack_name}[/].")
# Ensure encrypted key is stored in the Pulumi configuration
self.update_dsh_pulumi_encrypted_key(self._stack.workspace)

def update_dsh_pulumi_encrypted_key(self, workspace: automation.Workspace) -> None:
"""Update encrypted key in the DSHPulumiProject object"""
stack_key = workspace.stack_settings(stack_name=self.stack_name).encrypted_key
if not self.pulumi_config.encrypted_key:
self.pulumi_config.encrypted_key = stack_key
elif self.pulumi_config.encrypted_key != stack_key:
msg = "Stack encrypted key does not match project encrypted key"

Can you overwrite the encrypted_key after the stack is created?

Should be able to. My next question is then what is this update function meant to be for? What is the circumstance where finding mismatched keys should raise an error rather than overwriting the key with the correct one?

Logic should be this:

  • if a stack already exists, it should already have an encryption key
  • this key isn't stored in the stack (for obvious reasons) so we store it in the key vault and set it after loading the stack
  • if the stack doesn't already exist, it will generate a new encryption key
  • we need to store this key in the key vault so that it's available in future

From what you wrote above, I'm wondering whether loading the stack is generating a new key and we're not correctly overwriting it with the correct key which we've loaded from the key vault.

Yes, from what I understand from the documentation, stack_settings={self.stack_name: self.stack_settings} is an optional argument to LocalWorkspace and can include the encrypted key as one of the settings. That's what we already do, and it just seems to ignore it. So the trick here will be determining whether the stack already exists or not.

(btw, this is creating a new temp directory on every run which never gets deleted, or at least didn't on my machine - I had hundreds of vestigial temp directories)

OK, so it seems to me like when create_or_select_stack creates a new stack, it ignores the stack_settings object and generates a new key. When it selects an existing stack, it correctly uses the project key.

I can fix it so that I can overwrite the generated key with the project key, but then the update_dsh_pulumi_encrypted_key function is superflous, since the keys will always match: the LocalWorkspace will always have the project encrypted key for the stack.

I feel like there is something I'm missing here, but I can't put my finger on it

If the key exists in the key vault I think it's always going to be correct to overwrite the generated key with the one from the key vault.

  • if the SRE hasn't been deployed then it doesn't matter which key you use, as there won't be a state file to decrypt
  • if it has been deployed then:
    • either the key from the key vault is correct (in which case you should use it)
    • or your pre-existing local key is correct (how did this not get uploaded to the key vault though?)

If the key exists in the key vault I think it's always going to be correct to overwrite the generated key with the one from the key vault.

  • if the SRE hasn't been deployed then it doesn't matter which key you use, as there won't be a state file to decrypt

  • if it has been deployed then:

    • either the key from the key vault is correct (in which case you should use it)
    • or your pre-existing local key is correct (how did this not get uploaded to the key vault though?)

I don't think there ever is a pre-existing local key. A new local workspace is created every time we run deploy, teardown etc. If the stack doesn't exist it tries to create a new one with a new key. That errors out because the new key doesn't match the project key. Now when you run deploy again, the stack exists, so this time it uses a new local workspace with the project key.

Basically I'm just trying to make sure I'm not breaking something here...

I think the only thing that could break is if you have a local workspace that has deployed some resources but crashed before uploading its encryption key to the key vault. In that case, you would be able to load in create_or_load and the key wouldn't be anywhere else. However, we should be able to solve this by uploading the key as one of the first things that happens after it's generated.

I still don't think I understand why this is happening, or how to reproduce this. Has anyone else encountered this?

OK, so it seems to me like when create_or_select_stack creates a new stack, it ignores the stack_settings object and generates a new key. When it selects an existing stack, it correctly uses the project key.

So, I don't see how a mismatch is possible. If there isn't a stack to select, a new one is created with a key. If there is one, the key is in our Pulumi config.

Ok, some good news: I updated Pulumi to the latest version (from version 3.135.0 to 3.138.0), and no longer get this error. I can't find anything in the release notes (or issues) that seems relevant, but this feels like it was possibly a Pulumi bug. As far as I could tell it was ignoring the encrypted_key in the stack_settings object being passed to the LocalWorkspaceOptions when creating a stack but not when selecting one. Seems like that now works the way it should.

I do wonder whether we should address the way it creates a new temporary folder every time it loads/creates a stack, and never seems to delete that folder.

I updated Pulumi to the latest version (from version 3.135.0 to 3.138.0), and no longer get this error.

Can/should we update our minimum Pulumi executable requirement?

I do wonder whether we should address the way it creates a new temporary folder every time it loads/creates a stack, and never seems to delete that folder.

Worth fixing if quick/simple but not high priority.

Yes, let's bump the minimum version 👍

From what you were saying @craddm it did sound like a Pulumi bug. Or at least, something about your system was causing it to create new workspaces backed by local storage instead of using the Azure storage backend.

Sorry - my point was that we don't currently have a way to enforce a minimal pulumi.exe version, just the Python package version.

Oh, we mean the CLI here?

There might be a way to declare it as an external dependency in pyproject.toml. I don't think that would enforce anything though.

Yes, let's bump the minimum version 👍

From what you were saying @craddm it did sound like a Pulumi bug. Or at least, something about your system was causing it to create new workspaces backed by local storage instead of using the Azure storage backend.

Every time a stack is loaded or created, it makes a (new) local storage copy of the stack and pulumi config in a temporary directory. It'll do that on every system (it does it both in my devcontainer and on my Mac), so that in itself is expected behaviour rather than a contributor to this issue

Yes I think so, but that is different from the backend being used.

I think the problem here was, for some reason the Azure backend wasn't being used (or at least it wasn't loaded correctly).

Closing as we think this was a Pulumi bug.