cloudposse/docs

Document Atlantis with Geodesic

osterman opened this issue · 1 comments

what

  • Explain this mind warping concept

Introduction

When you run your infrastructure with geodesic as your base image, you have the benfit of being able to run it anywhere you have docker support.

For example, you can do so in multiple ways:

  1. On your local workstation
  2. On a remote ECS cluster (e.g. as a service task)
  3. On a remote Kubernetes cluster (e.g. as a Deployment)

Since we have a container, we should be able to apply all the standard release engineering "best practices" to build and deploy Infrastructure as Code.

Release Engineering

Before we continue, it's important to point out the (2) most common CI/CD pipelines:

  1. Monorepo CI/CD is where you have one repository with multiple apps, each with different SDLCs
  2. Polyrepo CI/CD is where you have one repository with a single app, and a single SDLC

Usually, the master branch (or some branch or tag like that) represents the state of production. That is, some commit sha should equal what has been deployed. If you follow, we're on the same page.

Both these strategies share one common pattern:

  1. Build Any time a Pull Request is opened or synchronized, then check out the code, build the application, and run the tests
  2. Deploy Any time a Pull Request is merged to master, deploy

Now, this is oversimplified. There's perhaps a lot more going on that just this, but the gist of it should be something like that.

Thought Experiment

Consider this thought experiment:

  1. We open up a Pull Request. All tests pass.
  2. We merge to master (our "production" state), which triggers a deployment.
  3. Deployment fails.

What do we do? The master branch now contains code that was not successfully deployed. Now our production environment has diverged from what is in git. That's no good.

From here we would typically expect a few things to happen.

  1. Our deployment process is so robust, the failed deployment didn't affect production. It was caught early during the "rolling update" deployment process (or "blue/green") rollout. We continue running the previous version in production.
  2. Our engineers revert the Pull Request, restoring the pristine nature of the master branch so that it represents production.

Now, we totally agree that the above process is how things should look like. But what happens if the technology or software we are using doesn't support that workflow? Do we try to fix the technology? Or do we find "compensating controls" so we can achieve the same outcome?

The problem

When we're deploying infrastructure as code ("IaC"), we're often deploying the backplane itself. The foundation upon which everything else runs upon. One of the most common tools for deploying IaC is the tool called terraform.

Anyone who uses terraform on a regular basis has probably seen the following error:

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Okay, great. Now what? How do we apply the "CI/CD Best Practices" to terraform when the tool itself doesn't support the key capability we've come to rely on to achieve CD for decades?

Side rant: This is not an easy problem to solve.

this is not the fault of terraform. It's extremely difficult to generalize what the rollback process should look like as it relates to IaC. It's better that a human operator identity the best course of action, rather than the tool making a best guess (e.g. "Opps, let's just destroy the production database and restore it to the previous version" because the security group didn't exist)

So now we have a couple of problems:

  1. We cannot reliably do rollbacks
  2. We may have master inconsistent with what's in production
  3. If we merge to master, then others in the company are going to start developing against a "desired" state that doesn't yet exist and might even be impossible to achieve.
  4. We have to pull the "emergency break" and stop everything

The Compromise

Now we're going to layout our solution to this problem. It borrows on the fine work of atlantis and bends it to our needs. The fundamental innovation of atlantis is a new kind of CI/CD pipeline.

Let's call this option (3):

"CI/CD Operations by Interactive Pull Requests".

But what does that mean?

The new workflow looks something like this:

  1. Create a new branch
  2. Make your changes in a branch.
  3. Open up a "Pull Request" when you want to see what should happen (e.g. terraform plan)
  4. Test those changes with a "dry run" automatically (if enabled & user is authorized)
  5. Use GitHub comments to interact with that pull request (e.g. atlantis plan or atlantis apply)
  6. To apply changes, get the PR approved. Then run atlantis apply.
  7. If successful, then merge to master. Else, go back to #2. Repeat until successful.

The new assumptions as it relates to a geodesic based infrastructure repo (E.g. testing.cloudposse.co:

  1. Treat the repo as a monorepo that contains multiple projects (e.g. in /conf) each with their on SDLC.
  2. Treat atlantis as one of the apps in this monorepo. It has it's own SDLC.

Here's what this then looks like:

  1. We deploy our geodesic container to some AWS account with an IAM role that allows it to perform operations at our behest. This is becomes one of our operating contexts that we can use to deploy infrastructure. Depending on where this container runs and the permissions it has, we can the capability to affect infrastructure.
  2. This container is receiving webhook callbacks from the infrastructure repo. When it receives an authorized request, it carries out the action. It checks out the code at the commit sha, runs the command. Each one of these callbacks is a different SDLC workflow. This is the monorepo CI/CD process.

Note, there can be multiple PRs open against the same /conf/$project, so it doesn't make sense to operate in /conf/$project. As such, atlantis checks out the work in a temporary folder and executes from there. IMPORTANT atlantis does not operate in the /conf folder the way a human operator would. It's more like atlantis is operating in something like the /localhost folder.

Thought Experiment #2

"I don't agree this is necessary!"

Okay, we hear you. We don't want to do this either. But let's consider the alternative: We build the docker image containing all the infrastructure as code treating this as a poly repo CI/CD pipeline.

Now we need to go apply the changes. How do we know what changed? We cannot use git techniques to identify the changes. The only way is to iterate over every project and do a terraform plan and possible terraform apply if there were changes.

If we do this, then:

  1. deployments will take forever for large infrastructures because we have to iterate over all projects;
  2. our dramatically expand the blast radius, since we possibly apply changes that were not clearly expressed by our PR (yes, this avoid drift, but the tradeoff is wicked)

Proposed Changes

  • When we run geodesic with atlantis we should move away from multi-stage and instead use terraform init -from-module=...

The one issue I thought of with the compromise section #6 what happens if there is new IaC approved and applied in a different PR that may affect state of the resources/infra in current PR? This was one of the reasons to only terraform apply on a single branch (master?). Perhaps there is a way for Atlantis to check if current branch is out of date with master/main branch?