getsentry/freight

Deploy Queue and Auto Deploy

mitsuhiko opened this issue · 13 comments

The end goal is to have a way to track branches and automatically deploy whenever commits therein happen. However the same problem happens if two commits are scheduled by any mean. So to isolate the problem, we want the following:

  • queue a commit by enqueueing a commit sha
    • this can happen based on a hook
    • alternatively manually
  • when multiple commits are queued up, we skip irrelevant commits to the same target

The proposal is to have a deploy stack for each target (production, staging etc.).

  • When a new commit should be deployed it's added to the queue.
  • If it's a named target (like a branch) then the target is resolved immediately into a commit hash
  • it's pushed to the top of the stack for the target
  • Items can be removed from anywhere within the stack to revoke a pending deploy (stack with benefits)

Independently of this there is the deploy logic:

  • There is a system that monitors the stack for each target and always deploys top.
  • Once the deploy is done it looks at the stack and if it's not empty, it deploys the top again.
  • At any point any stack items older than the last deploy are ignored.
  • In an ideal situation this ends up in an empty stack.

Example:

DEPLOY A (stack was empty, deploy starts)
DEPLOY B
DEPLOY C
DEPLOY D
DEPLOY E
STACK state: B C D [E] (E is top, A was already removed immediately)

Once A finished deploying E is the top of the stack as it's newer than the deploy of A. When E is done, D C and B are removed from the stack as they are too far in the past timestamp wise. If E would have been revoked before deployed, D would have been deployed as expected.

I think the stack can be virtual.

For example:

  • We have a "Revoked SHAs" mapping that says "these should not be deployable
  • When REF goes to deploy, it sees E is newest but revoked. It instead looks at E^ and continues upwards to see if anything since A is possible.

This would then make the process look like:

  • Enqueue (APP, ENV, REF) [a unique constraint in the queue would exist on this key]
  • When item is popped off queue, after small grace period, REF is resolved to SHA (using the above behavior)

In any case we need the stack for checking revokes because we cannot go by git log. So the list needs to exist somewhere, stack being the easiest.

Why can we not use git log?

Per conversation over Slack:

  • Enqueue a job for REF
  • Queue has uniqueness on REF + APP + ENV
  • When job pops from queue to run it resolves REF
  • Grace period for jobs in queue

Additionally:

  • Panic button (or similar) to turn off auto deploys
  • [Future] Blacklist SHAs to ensure that a check (i.e. build system) being valid doesn't have to mean the commit is safe

I think for a simple approach (to avoid rewriting too much) we should do this atm:

  • Add a 'queued' state. This is ensure we have a state between "im waiting in the queue" and "the worker hasn't started the task yet"
  • Add constraint on task index that checks for queued jobs matching REF + APP + ENV
  • Jobs created via task index become status = queued (instead of status = pending)
  • Have a celerybeat job which checks if there are any open queue slots
    • open signifies "no jobs matching APP + ENV with status in (pending, in_progress)

Alternatively we could immediately mark something as in_progress once the execute_task job has been shipped to the queue (i.e. on .delay). That lets us avoid the extra queued state (and confusion of pending + queued).

The celerybeat solution is pretty shit, but there's no great way to guarantee jobs get run without late acks (which dont work in Celery + Redis).

Changing behavior to resolve the SHA after it enters the queue is surprisingly complicated. It means no longer can you get synchronous feedback of "this build is failing".

We could make the behavior special cased.

i.e. when auto deploy happens it delays ref resolution, but when you create a task by default it resolves immediately.

rshk commented

I'm interested to hear your use cases about this..?

About skipping commits from the queue if a newer one is scheduled to be deployed, I'm not sure I'd want that, especially in case migrations are involved; actually we were considering ways to mark commits as "deploy me", to make sure deployment of a future version will trigger a deploy of that commit as well. An example (history from older to newer):

aaaa Something
bbbb Something
cccc Migration to create model X #DEPLOYME
dddd Code using model X
eeee Something

Let's say currently running version is aaaa and I want to deploy eeee, a deploy of cccc should be scheduled as well.

This should probably be responsibility of the caller (?), but for sure I wouldn't want freight to skip deployment of cccc just because eeee is already scheduled..

..maybe it would be worth adding that "deployme" attribute to builds, to make sure they're not skipped?

rshk commented

About rollbacks, it might be worth requiring some extra flag to explicitly say "I know I'm about to deploy an older version, this is exactly what I want, go ahead"?

Eg. I 99% of the times I really don't want to risk scheduling a deployment of an older version by mistake; this could likely happen due to a "race condition" between two people scheduling deployments at once.

rshk commented

About reverts: as they usually means something bad happened already, I'm not really sure I'd like to have anything magic going on; maybe the safest option would be to just have some sort of "panic button" that will do something like:

  • pause all the queued builds
  • stop accepting build requests from webhooks until "resolved"

(this might be triggered automatically as well, eg. when a commit marked as revert is found on the branch, but I wouldn't try to be too smart about the behavior..)

rshk commented

[Future] Blacklist SHAs to ensure that a check (i.e. build system) being valid doesn't have to mean the commit is safe

Maybe just creating a "status" on github associated with a commit would be the quickest way? (should work already, w/o any change on freight side).

https://developer.github.com/v3/repos/statuses/#create-a-status

@rshk we dont want to be dependent on GitHub so while pushing that data upstream would be fine, relying on it wouldn't be.

Regarding auto deploy and reverts, we already implemented reverts. The behavior is "deploy the previous green build". Having a way to freeze things is definitely on the want list, but we didn't build that yet. We also haven't sorted out the "auto deploy should resolve later", but I think the case of saying "i need a random version to go out before another version goes out" is very use case specific. In that case I dont think you should rely on any standard auto deploy, but rather you'd need entirely custom automation logic or manual usage to achieve it.

rshk commented

We also haven't sorted out the "auto deploy should resolve later"

Why do you want this? I mean, personally I would feel much safer to know in advance which exact commit is going to be deployed, rather than generically "tip of branch X"..? Besides, this might mean having to wait for CI on a newer commit rather than deploying a slightly-older green one (although this might be desired if deployment takes a long time..?)

Eg:

  • push aaa to master
  • CI green for aaa
  • schedule deploy 1 for master
  • deploy 1 start with aaa
  • push bbb to master
  • CI green for bbb
  • schedule deploy 2 for master
  • push ccc to master
  • deploy 1 finish
  • try to start deploy 2, but have to wait for CI on ccc (while bbb was deployable)

I think the case of saying "i need a random version to go out before another version goes out" is very use case specific. In that case I dont think you should rely on any standard auto deploy, but rather you'd need entirely custom automation logic or manual usage to achieve it.

+1 about not being freight responsibility to decide what to deploy, I'm mostly concerned that "skipping deployments if one for a newer version is scheduled" would prevent doing this entirely.

Btw, I'm curious about how you handle deploying database migrations (and in general ensure that the version being deployed can live just fine alongside the old one.. or do you use any better approach to this?)

Why do you want this? I mean, personally I would feel much safer to know in advance which exact commit is going to be deployed, rather than generically "tip of branch X"..? Besides, this might mean having to wait for CI on a newer commit rather than deploying a slightly-older green one (although this might be desired if deployment takes a long time..?)

It's not about safey, its about efficient automation. We deploy when things are green in other systems, therefore every commit should be safe. We can't queue up 100 deploys (one for each commit) as its very easy to commit faster than we can deploy.

Btw, I'm curious about how you handle deploying database migrations (and in general ensure that the version being deployed can live just fine alongside the old one.. or do you use any better approach to this?)

We use a migration framework. The standard pattern for this behavior is "what version am I at? apply all migrations since that version". This happens as part of our deploy process and we keep history for all of time here. Even if we didn't keep all of time, we'd squash to keep at least the last X days (say 30 days).