hashicorp/terraform-provider-tfe

Incorporate mitchellh/multispace resource multispace_run into this provider

brandonc opened this issue ยท 22 comments

Use-cases

There is a lot of utility in being able to control the lifecycle of a workspace run through this terraform provider.

Bootstrapping

Many resources, like tfe_variable, are dependencies of a successful run, so queue_all_runs=true doesn't always behave as intended. Queueing an initial run can sometimes take place before all resources are created. For instance, the workspace itself must be created before the workspace variable may be.

With a run resource to control the workspace lifecycle, you could disable queue_all_runs and trigger the initial run after all the workspace dependencies are in place.

End-of-life

Now that safe delete is in place in the platform, running a destroy plan on the workspace before it is deleted is a convenient way to ensure the safe deletion of workspaces. There is no other workflow to run a destroy plan and destroy the workspace itself in a single step without a lot of glue code.

Maintenance

I have suggested the use of the multispace provider enough times to consider incorporating it into the official tfe provider. See #534, #340, and #194 for examples.

Unfortunately, the API has evolved since multispace was released, and there are things like new workspace statuses that the provider does not accomodate. By incorporating this provider, we can keep it up to date and more easily promote it to solve a variety of lifecycle issues.

Back when we were trying to work our heads around multispace that first time, I got deeply tripped up by the nonsensicality of managing "a run" as a material resource that can be created/updated/destroyed. But once we got past that, I came around to the utility of representing runs (or run-readiness) as nodes in the graph.

(If anyone else is struggling with that, consider:

  • Don't think of these as a way to manage ongoing runs once the workspace is in a steady state; think of the initial run as being a distinct Thing that's necessary for the combination of the workspace and its permissions to come into a completed state of existence.
  • In other words, "some arbitrary run" definitely isn't a resource, but "the initial run" arguably is, and so is "the final destroy run".
  • Alternate perspective: consider the initial run as serving the same purpose that a Terraform provisioner serves for a virtual machine resource. Though, provisioners wouldn't quite work for this case because of the various dependency orderings.

)

Anyway, I'm in favor. But since we'd be maintaining this indefinitely, I think it's worth taking the opportunity to reconsider some of the design choices from multispace, in the hope of getting something more tightly focused and with fewer edge cases.

Most notably, I want to propose splitting the multispace run into several distinct kinds of run objects, to serve the following distinct purposes:

  • Marking run-readiness in the graph. Or, in other words: "queue_all_runs except it actually works right:"
    • You want to kick off the initial run for the workspace once it's ready.
    • However, the state of "ready" is interdependent with other resources like varset attachments etc., so it can't be represented by the workspace resource itself being fully created and needs to take into account other graph nodes (which might themselves depend on the workspace node).
    • And once the initial run kicks off, it's fine if it then exits the concern of provisioning for that workspace; it's part of normal operation now, once provisioning has push-started it. The provisioning run can end before the new workspace's run ends.
  • Orchestrating sequential runs. The dependency workspace pattern, where you provision k8s and then provision a bunch of crap into k8s.
    • In this case, the workspace provisioning run needs to contain the duration of every run it's orchestrating. I found that the most ๐Ÿ˜ฌ part of working with multispace, but yeah, sometimes you just honestly need to run some time-consuming stuff in a dependency graph, and terraform does dependency graphs.
    • We probably need this option available, but I think everything else gets easier to work with if you don't have to touch this when you don't actually need sequential dependent runs.
  • Pre-delete destruction.
    • I have a suspicion that we can avoid a buttload of edge cases if we represent this as a totally separate resource, rather than trying to tack it onto the lifecycle of a run-readiness marker.
      • I might be wrong about that.
    • But other than that this basically acts like orchestrated runs, since you need the rest of the destroy plan to pause until it completes.

I guess mostly I have a feeling that a bunch of people are going to want the "kick off the run at the correct time but don't wait for it" path. If that's wrong, then I'm on a bit of a false trail here.

  • tfe_workspace_initial_run
  • tfe_workspace_complete_initial_run
  • tfe_workspace_destroyable_run

idk

Let's do it! I think we'll want to update docs with examples as part of this effort to help users understand how to use it.

I really like the idea of incorporating some of the multispace functionality into the TFE provider :D

As I mentioned on Slack, I have a fork https://registry.terraform.io/providers/lucymhdavies/multispace/latest/docs/resources/run

This adds a few additional things added since Mitchell's latest release, these being the wait_for_apply, do_apply, wait_for_destroy and do_destroy parameters, which are particularly useful for how I use the provider.

These are useful for me, and the docs explain what they do and why. Whether they're generally useful... I don't really know.

From @nfagerlund's comment above...

I guess mostly I have a feeling that a bunch of people are going to want the "kick off the run at the correct time but don't wait for it" path. If that's wrong, then I'm on a bit of a false trail here.

This is indeed how I use it myself, in what I will call a Bootstrap Workspace. I create a workspace, then all the supporting resources (variables, run tasks, notifications, permissions, whatever...), then I kick off a run. In my case... my Bootstrap Workspace doesn't really care if the newly created workspace applies successfully or not, just that a run has been triggered when the workspace was empty. This is kinda the main difference between my version of the Provider and Mitchell's original. This is my wait_for_apply = false parameter.

wait_for_destroy = false in my version exists mostly for completeness. I don't use it myself right now, but it felt like it might be useful at some point.

As for the suggestion of separate resources for the different types of runs, like you suggest in #742 (comment)
I don't personally like this idea, on the basis that you may decide that something is a tfe_workspace_initial_run when you initially create this resource, but you may later decide it's actually a tfe_workspace_destroyable_run. I know we have moved blocks that can aid in that kind of refactoring... but I don't know if you can use them to convert resource_a into resource_b

I like the ideas presented here, and would like to first consider shipping an opinionated default and adding options to support alternatives. The worst case scenario is that someone destroys the run resource but doesn't understand that a destroy run is executed. We should make that extremely clear in the config. After all, trying to destroy a workspace that has resources is safe and that is a nice backstop for whatever can go wrong in their understanding. I'm still weighing having a separate resource vs something like do_destroy. I think I prefer the latter, as well.

wait_for_apply = false seems helpful but I would prefer the default behavior be "ensure my run succeeded"

Agreed. wait_for_apply = true (and wait_for_destroy = true) is the default in my fork for that reason (I realise my docs don't make that clear) :)

And agreed with the idea of an opinionated default.

The worst case scenario is that someone destroys the run resource but doesn't understand that a destroy run is executed.

That's a good point to consider. If we do have the equivalent of do_destroy = true from my fork, then the docs would need to be VERY clear that this will happen.

The counterpoint is... if do_destroy = false is the default... then a user might expect deleting a workspace to trigger a destroy, and it does not, leaving orphan resources. So it's not an easy decision which works better as a default behaviour ๐Ÿค”

...you may decide that something is a tfe_workspace_initial_run when you initially create this resource, but you may later decide it's actually a tfe_workspace_destroyable_run.

Ah yeah, good point: pain-of-relocation is worth keeping in mind.

Just to make sure we're discussing the same thing, tho: I'm envisioning that separate destroyable_run I proposed as managing only the destroy run โ€” creating the resource wouldn't do anything (you'd need an initial run resource for that), but it would create a blocking run once you decide to destroy the workspace.

With that out of the way: I definitely prefer your design of a single initial run resource with a wait_for_apply argument! ๐Ÿ’– But I'm still on the fence about whether that resource should also represent a destroy intent.

I'm ๐Ÿ‘ for this, I do have two comments:

  • If the default is to wait for the run to complete, this resource would require the user to intervene at some arbitrary point during the execution of the apply. This to me seems odd and deviates from the regular Terraform UX. We don't have a great way of letting the user know that a run is awaiting confirmation and just from reading apply logs it's unclear what state the run is in. We can also of course allow the user to enable the non-interactive option for executing the run.

  • I really like @nfagerlund 's suggestion on being explicit about the intent of the run by using different resources. Bundling the different use cases (do_apply and do_destroy) into one resource will definitely make it easier for users to step on themselves if they aren't cautious. I'd like to prevent users from accidentally queuing a destroy plan and, in my opinion, calling it out on the docs is not sufficient guard rails.

@sebasslash I do believe that the run is confirmed automatically by multispace_run unless manual_confirm is set to true.

That's right, if the child workspace needs confirmation, then multispace will confirm it and allow TFC to apply it. I'm still leaning towards fire-and-forget as a default behaviour though. i.e. kick off a run, and call it good; let the run complete asynchronously.

As for whether we have a single resource, with do_destroy=true as an option... I've had some thoughts overnight, and I definitely think that it should not be the default. You'd want to explicitly say "yes, I want Terraform to run a destroy on this workspace, if we destroy this resource".

My thinking is... this tfe_workspace_run would most likely be used in combination with a tfe_workspace (and other supporting resources). My concern was that destroying the tfe_workspace without having that workspace do a destroy run first could leave orphan resource. But, given that by default, destroying a tfe_workspace does a Safe Delete... then attempting to destroy the tfe_workspace will fail, keeping those orphan resources intact. This feels like much less of a footgun for users who aren't being careful.

So... yeah, I'm leaning towards do_destroy=false as a default behaviour.

Config option 1: Single resource, separate blocks for apply and destroy. One is required, but both are allowed. The lack of a particular block will omit the apply or destroy.

resource "tfe_workspace_run" "example_run" {
  organization = tfe_organization.example-multispace.name
  workspace    = tfe_workspace.example.name

  # One or both blocks are required
  apply {
    wait = false
    manual_confirm = true
    retry          = false
  }

  destroy {
    retry = false
  }

  depends_on = [
    tfe_workspace_run.root
  ]
}

Config option 2: Separate apply/destroy resources:

resource "tfe_workspace_apply" "example_apply" {
  organization = tfe_organization.example-multispace.name
  workspace    = tfe_workspace.example.name

  wait = false

  depends_on = [
    tfe_workspace_run.root
  ]
}

resource "tfe_workspace_destroy" "example_destroy" {
  organization = tfe_organization.example-multispace.name
  workspace    = tfe_workspace.example.name

  manual_confirm = true
  retry = false

  depends_on = [
    tfe_workspace_run.root
  ]
}

Option 3: Conventional, but different defaults:

resource "tfe_workspace_run" {
  organization = tfe_organization.example-multispace.name
  workspace    = tfe_workspace.example.name
  
  do_destroy = true
  manual_confirm = true
  retry = false
}

I like option 1. It logically groups the behaviours together.

The only real downside I see with option 1, by requiring the user to provide at least one, it may be a little unintuitive. But this would get picked up by Terraform when you try to plan and show the user a nice helpful error message.

As for option 2, with separate resources... I've already mentioned the refactoring issue that could pose, although I don't know how often that's likely to be a problem in the real world.

I do however think having a tfe_workspace_destroy resource is itself gonna be confusing. A user might look at that and think that it means a destroy will be triggered when the tfe_workspace_destroy resource is created. Whereas the purpose of the resource is instead to say "when this resource is destroyed, we will trigger a destroy on this other workspace"

Well said... I agree with those assessments. But I'm not at all sure about any of this being intuitive ๐Ÿคฃ and at least there are more code points in proximity with option 1 that make documentation simpler.

A few random thoughts, unrelated to one another...


Given the framing that these initial runs are there as a sort of "initial provisioning" of a workspace shortly after it's created, it feels to me like the approval to create the tfe_workspace_apply resource type (or whatever it's called) is sufficient approval to approve the run, without a further approval within the individual runs.

I agree with the observations made above that it's typically problematic for a Terraform provider to do something that requires another synchronous out-of-band action outside of Terraform. We've run into this before with use-cases such as the two-way approvals for cross-VPC peering in AWS, and it created a bunch of friction with e.g. folks not realizing they needed to do that external action and having the operating time out, or knowing they needed to do it but their session or 2FA expiring at just the wrong time so they took too long to get there, etc. I would recommend doing whatever you can to make this operation totally non-interactive.

(I would not make that same call if the resource were representing arbitrary runs that could potentially be modifying already-created infrastructure, but for initial create this seems most aligned with the supposed user intent, and relatively low-risk.)


Any time where it feels unclear what users would expect as a default value and the consequences of that decision are irreversible, I think that's a good prompt to make the argument required and thus not have a default at all.

By this I mean to consider making this do_destroy (or similar) argument required and thus force the user to make an informed choice between do_destroy = true and do_destroy = false, since both of those options have potentially-bothersome consequences. (Deleting stuff unexpectedly vs. leaving objects behind in the remote system unexpectedly.)


In an ideal world I would prefer to see the planning step (CustomizeDiff in SDKv2 terms) for tfe_workspace_apply actually start the run, wait for it to reach the apply confirmation step, and then somehow summarize the content of the inner plan as part of the outer plan, so folks can see at least a high-level overview of what this plan is going to do before approving it. The Create step could then approve the plan and allow it to proceed to the apply phase.

I expect that isn't actually feasible with the current design of Terraform Cloud's run state machine, because:

  • Saying "no" to the outer plan would leave the inner plan stuck at its approval step rather than cancelling it,because Terraform doesn't inform a provider when a plan is discarded.
  • An active plan blocks further process until it's explicitly applied or cancelled, so running terraform plan more than once without applying it wouldn't work: the inner run created by the second plan would get stuck in the run queue of the workspace and be unable to proceed.

So I don't think this is really viable for right now but I wonder if 1) you all can see a path to making this integrate better with the outer plan in the future (by changing the Terraform Cloud API to better match the provider's needs), and if so 2) does that suggest any design tradeoffs that would leave room in the initial design to allow for that future expansion later?

Perhaps this is just a bridge to far and we just need to accept the initial runs being planned and applied all at once during the Create step with no feedback, but I thought it couldn't hurt to ask the question. ๐Ÿคทโ€โ™‚๏ธ

Given the framing that these initial runs are there as a sort of "initial provisioning" of a workspace shortly after it's created, it feels to me like the approval to create the tfe_workspace_apply resource type (or whatever it's called) is sufficient approval to approve the run, without a further approval within the individual runs.

Agreed. While this new resource has the potential to be used for other things... it being designed for an initial run on a newly created workspace makes this seem reasonable.


Any time where it feels unclear what users would expect as a default value and the consequences of that decision are irreversible, I think that's a good prompt to make the argument required and thus not have a default at all.

Yeah, I think I agree on this point.

If we go with an Option 3 style config #742 (comment)
then definitely make both do_apply and do_destroy mandatory, with no default behaviour, so the user explicitly opts in to the behaviour.

If we go with an Option 1 style config #742 (comment)
then this kinda comes naturally, as you need to have one or both of the apply or destroy blocks.


In an ideal world I would prefer to see the planning step (CustomizeDiff in SDKv2 terms) for tfe_workspace_apply actually start the run

Yeah, that'd be super cool. I don't think it'd be possible in all cases (e.g. if the outer plan results in creating a workspace, variables, creds, whatever... then the inner plan can't actually be run). But... it could be doable in some circumstances. Tricky though, and definitely feels like it'll require a lot more changes that make it waaay out of scope for this. But a cool idea to revisit in future.

Something else to consider as part of this feature (ie A common use case) is the usage of the remote_state_consumer_ids between workspaces. Consider the following scenario:

Assume that you had two workspaces which also contains variables and needs to share outputs with each other:

resource "tfe_organization" "test-organization" {
  name  = "my-org-name"
  email = "admin@company.com"
}

resource "tfe_workspace" "test" {
  name         = "my-workspace-name"
  organization = tfe_organization.test-organization.name
  tag_names    = ["test", "app"]
  remote_state_consumer_ids = [tfe_workspace.foo.id]
}

resource "tfe_variable" "test" {
  key          = "my_key_name"
  value        = "my_value_name"
  category     = "terraform"
  workspace_id = tfe_workspace.test.id
  description  = "a useful description"
}

resource "tfe_workspace" "foo" {
  name         = "my-foo-workspace-name"
  organization = tfe_organization.test-organization.name
  tag_names    = ["foo", "app"]
}

resource "tfe_variable" "foo" {
  key          = "my_key_name"
  value        = "my_value_name"
  category     = "terraform"
  workspace_id = tfe_workspace.foo.id
  description  = "a useful description"
}

With this in mind, Terraform would create the organization, then the foo workspace, then the foo variable next, followed by the test workspace, and finally the test variable. However, if we wanted to have this executed in a single run, the order of execution differs a bit. The test workspace would need to run first and then the foo workspace.

Agreed. I don't think we'd need to do anything specific in the provider to make this work though. In this instance, we'd have...

resource "tfe_workspace_run" "test" {
  organization = tfe_organization.test-organization.name
  workspace    = tfe_workspace.test.name
}

resource "tfe_workspace_run" "foo" {
  organization = tfe_organization.test-organization.name
  workspace    = tfe_workspace.test.name

  depends_on = [
    tfe_workspace_run.test
  ]
}

merged in #786!