cloudfoundry/cloud_controller_ng

Canary Deployments Proposal

sethboyles opened this issue · 7 comments

CF Canary Deployments

Authors: @sethboyles @pivotalgeorge @joaopapereira
Reviewers: @Gerg @tcdowney @Samze
CF Canary Deployments 

Draft Proposal

Authors: Seth Boyles George Gelashvili Joao De Almeida Pereira

Reviewers: Greg Cobb

Feature goals

Canary Deployments will allow App Developers to create a new Application Instance with a new version of their application. By monitoring the Canary Instance, App Developers will be able to ensure the reliability of this Canary Instance before promoting the new version to the rest of their application instances.

How to Use It

User Workflow

​​

Basic Usage

Starting a Canary Deployment

App Developers will be able to push an application with Canary Deployments by using the --strategy flag on the CLI, similar to Rolling Deployments:

$ cf push myapp --strategy=canary

(this flag will also be available for other actions like restart and restage)

Once the Canary Deployment has brought up the canary instance, the CLI will exit.

Observing the Canary Instance

App Developers will be able to monitor the canary instance’s status by:

  • monitoring app logs by a tag identifying which logs originate from the Canary Instance

  • routing requests to the canary instance directly

Promoting the New Version

If the App Developer determines that the Canary Instance is reliable and wants to promote the new app version to all instances, they would execute the cf continue-deployment command:

cf continue-deployment myapp

Cloud Controller will then promote the rest of the Application Instances to the new app version.

Canceling the Canary Deployment

If the App Developer  determines that the Canary Instance is NOT reliable, they would execute the  cf cancel-deployment command:

cf cancel-deployment myapp

Cloud Controller will then tear down the Canary Instance.


Technical Behavior Overview

  1. Upon creation, a Canary Deployment brings up a single ‘Canary Instance’ of the new process.

  2. The Deployment will pause, awaiting external evaluation.

  3. An App Operator (Admin, Space Developer, or Space Supporter) will indicate that the Canary Instance has a) passed evaluation or b) failed evaluation

  4. If the Canary Instance has failed evaluation, the App Operator will Cancel the Deployment to rollback to the previous revision

  5. If the Canary Instance has passed evaluation, the App Operator will Continue the Deployment to promote the new revision

CF CLI

Initiating a Canary Deployment

Currently the CF CLI supports passing --strategy=rolling option to the push, restage, and restart commands to use a Rolling Deployment.
Similarly, adding support for passing canary to the --strategy flag would allow App Operators to deploy with a Canary Deployment.

$ cf push myapp --strategy=canary

Like Rolling Deployments, the CF CLI will poll CAPI for updates on the deployment’s progress. Unlike with Rolling Deployments, where the CLI waits until the Deployment reaches the FINALIZED status before exiting, the CLI will instead wait until the Canary Deployment has reached the PAUSED status. Once the deployment is paused, the CLI will prompt the user to call the continue-deployment command and exit.

$ cf push myapp --strategy=canary
# ...streamed staging output..
Starting canary deployment for app myapp...
Waiting for canary instance to deploy...

Canary instance starting...

name:              myapp
requested state:   started
routes:            myapp.example.com
last uploaded:     Tue 21 May 20:50:44 UTC 2024
stack:             cflinuxfs4
buildpacks:
        name             version   detect output   buildpack name
        ruby_buildpack   1.10.5    ruby            ruby

type:            web
sidecars:
instances:       3/3
memory usage:    1024M
start command:   bundle exec rackup config.ru -p $PORT -o 0.0.0.0
     state     since                  cpu    memory        disk           logging              details
#0   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited
#1   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited
#2   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited

type:            web
sidecars:
instances:       1/1
memory usage:    1024M
start command:   bundle exec rackup config.ru -p $PORT -o 0.0.0.0
     state     since                  cpu    memory        disk           logging              details
#0   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited

Canary Deployment PAUSED. 

Please run `cf continue-deployment myapp` to promote the canary deployment, or `cf cancel-deployment myapp` to rollback to the previous version.

Continuing a Canary Deployment

Upon executing the cf continue-deployment command, the CLI will call the deployment’s continue action and the Deployment will proceed until completion like a Rolling Deployment.

$ cf continue-deployment myapp
Continuing app myapp in org org / space space as admin...

Waiting for app to deploy...

Instances starting...
Instances starting...
Instances starting...

name:              myapp
requested state:   started
routes:            myapp.example.com
last uploaded:     Tue 21 May 20:50:44 UTC 2024
stack:             cflinuxfs4
buildpacks:
        name             version   detect output   buildpack name
        ruby_buildpack   1.10.5    ruby            ruby

type:            web
sidecars:
instances:       3/3
memory usage:    1024M
start command:   bundle exec rackup config.ru -p $PORT -o 0.0.0.0
     state     since                  cpu    memory        disk           logging              details
#0   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited
#1   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited
#2   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited

Surfacing Canary Deployment Status

App operators will be able to discover if an app is currently PAUSED during a Canary Deployment by calling the cf app command:

$ cf app myapp
# Streamed staging output

name:              myapp
requested state:   started
routes:            myapp.example.com
last uploaded:     Tue 21 May 20:50:44 UTC 2024
stack:             cflinuxfs4
buildpacks:
        name             version   detect output   buildpack name
        ruby_buildpack   1.10.5    ruby            ruby

type:            web
sidecars:
instances:       3/3
memory usage:    1024M
start command:   bundle exec rackup config.ru -p $PORT -o 0.0.0.0
     state     since                  cpu    memory        disk           logging              details
#0   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited
#1   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited
#2   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited

type:            web
sidecars:
instances:       1/1
memory usage:    1024M
start command:   bundle exec rackup config.ru -p $PORT -o 0.0.0.0
     state     since                  cpu    memory        disk           logging              details
#0   running   2024-05-21T20:50:56Z   0.4%   46M of 1G     129.6M of 1G   0/s of unlimited

Canary Deployment PAUSED. 

Please run `cf continue-deployment myapp` to promote the canary deployment, or `cf cancel-deployment myapp` to rollback to the previous version.

Additionally, this might be an opportune time to add other Deployment information with the cf app, (such as Rolling Deployments, or Canary Deployments pre/post the PAUSED step)

CAPI

Creating a Canary Deployment

Creating a Canary Deployment will use the strategy property of the Deployment resource.  Instead of the value rolling (which currently is the only valid value), clients will set the value to canary to create a Canary Deployment.

POST https://api.example.org/v3/deployments

Canary Deployment JSON example:

{
  "droplet": {
    "guid": "[droplet-guid]"
  },
  "strategy": "canary",
  "relationships": {
    "app": {
      "data": {
        "guid": "[app-guid]"
      }
    }
  }
}

Upon creation, a Canary Deployment will immediately bring up a single Canary Instance of the app’s new revision or droplet.

Monitoring Progress of a Canary Deployment

A new status.reason on the Deployment object, PAUSED, will be introduced to track the Canary Deployment’s state as the Canary instance is evaluated.

"status": {
    "value": "ACTIVE",
    "reason": "PAUSED",
    "details": {
      "last_successful_healthcheck": "2024-04-25T22:42:10Z"
      // New property to show when Canary Deployment was transitioned to PAUSED
      "last_status_change": "2024-04-25T22:32:10Z"
    }
  },

(see The Deployment Object in the V3 apidocs for more information on the status field)

Once in the PAUSED state, Canary Deployments will remain PAUSED until an App Operator has indicated external evaluation has passed and the deployment is ready to proceed.

Initially there will be no timeout–Canary Deployments can remain PAUSED indefinitely.  See [Configurable Timeout] under [Possible Future Enhancements]

Promoting Canary Deployments

Once the App Operator has determined they would like to promote the Canary Deployment, they will call an action endpoint (see action to cancel a deployment for an existing action)

POST https://api.example.org/v3/deployments/[deployment-guid]/actions/continue

Once the Canary Deployment’s continue action has been called, the Deployment will transition from PAUSED to DEPLOYING

The Canary Deployment will proceed similar to a Rolling Deployment (that is, 1 new instance will be brought up and 1 old instance will be brought down in serial, repeated with no further pausing until the Deployment is complete).

Canceling Canary Deployments

App Operators will be able to use the existent Cancel Deployment API action to rollback a Canary Deployment with current status ACTIVE and reason PAUSED.

Supersedence of Canary Deployments

Like Rolling Deployments, Canary Deployments can be superseded by a Deployment created before the Canary Deployment has finished. 

If a Canary Deployment has status.value of ACTIVE, then the Deployment can be superseded, even if the status.reason is PAUSED.

Possible Future Enhancements

While the above proposal is kept as feature-minimal as possible while meeting the needs of a basic Canary Deployment, App Operators may eventually expect more control over their Deployment strategies.  The following are potential ways Canary Deployments and Deployments in general can be enhanced. 

To support various configurable options specific to the deployment strategy, a new options property could be added to the Create Deployment request:

{
  "revision": {
    "guid": "[revision-guid]"
  },
  "strategy": "canary",
  "options": {
    "canary_options": {
      "steps": 3,
      "instances_per_step": 4,
      "step_timeout": 600, // timeout in seconds
    },
    "max_in_flight": 2,
   
   }
  ...
}

Configurable Number of Canary Steps

App Operators may wish to perform multiple evaluations of a Canary Deployment.

"steps": 3 // Default: 1

A Canary Deployment with a step value of 3 would transition to PAUSED 3 times throughout the entire rollout. The Canary Deployment would require the App Operator to call the continue action 3 times before fully promoting the canary.

A step value of NULL would require the App Operator to evaluate the entire rollout.

Configurable Number of Canary Instances per Step

"instances_per_step": 4,

An instances_per_step property would allow multiple Canary Instances to be brought up before the Deployment is PAUSED for evaluation.

Configurable Step Weights (alternative to Canary Steps/Canary Instances per Step)

A single configurable value, stepWeights, could be an alternative to configuring instances_per_step and steps. 

"step_weights": [20, 40, 50, 100]

A Canary Deployment with the above step_weights, would roll out 20% of instances, then 40% (total), 50%, 100%, pausing at each step for evaluation.

Configurable Max-in-Flight

"max_in_flight": 3,

Note: break this out into new document

Orthogonal to Canary Deployments, max_in_flight is also applicable to Rolling Deployments. A Deployment with max_in_flight of 3 would simultaneously bring up 3 new instances at once, and tear down one old instance as each new instance is brought up.

This, however, is complicated by the Canary Deployments PAUSED state–would the teardown of instances wait until after the Deployment’s continue action has been called?

NOTE: Distinction between ‘Instances per Step’ and ‘Max-in-Flight’

instances_per_step and max_in_flight differ in purpose/behavior:

  • max_in_flight: number of instances CC will request Diego to bring up/down at once. (a value that could be applied to both Rolling Deployments and Canary Deployments)

  • instances_per_step: number of instances to rollout before pausing for evaluation (A Canary Deployment specific value)

A Canary Deployment with instances_per_step of 10, but max_in_flight of 1, would create a slow rollout that paused after 10 canary instances were brought up.

Configurable Timeout

An optional configurable timeout property named step_timeout could be added to the Deployment resource:

{
  ...
  "strategy": "canary",
  "options": {
    "canary_step_timeout": 600 // timeout in seconds
   }
  ...
}

If the timeout is reached without the Canary Deployment having been progressed via the “continue” action endpoint, the deployment would automatically be canceled and rolled back to the previous revision (i.e. the single canary instance will be taken down)

The name step_timeout is chosen as opposed to timeout to clarify the timeout is not a generic timeout that could apply to the entire deployment lifecycle, or to other deployment types, like rolling.

Support for automatic evaluation of SLOs

Automatic rollback of a Canary deployment based on app metrics such as HTTP request success rate, response time, or other or custom metrics will likely require large cross CF-component changes to support.

Deployment Specific Routing

Currently App Instance routing does not work with multiple processes. To allow for such features as keeping a subset of user sessions only on the old/new deployment instances, we would need to fix instance-based routing and expand it to support instances from different processes.

Mirroring of Idempotent Requests

Traffic mirroring (i.e. mirror traffic to from each incoming request, sending one request to the new app version and one to the old, as a way of measuring the new version without impacting user experience) would require to ability to route to individual app instances and also likely require large cross CF-component changes to support.

Thank you for the useful proposal! I have two questions:

  1. Is there a difference in the behaviour between the canary compared to rolling strategy when the new app version introduces changes to the app environment or new service bindings? What happens with the changed app environment or the new service bindings when cancel-deployment is executed. Please see issue #3531 for more details.
  2. Often I get feedback from CF users that it will be great if rolling strategy could add CF API option (CLI flag) which can trigger service binding re-creation. It means that CF creates a new service binding which is bound to the updated app instances and the old binding is deleted when the deployment is through. I didn't evaluate technically what does it mean to add such a feature but do you see this as a future improvement for the update strategies?

Is there a difference in the behaviour between the canary compared to rolling strategy when the new app version introduces changes to the app environment or new service bindings? What happens with the changed app environment or the new service bindings when cancel-deployment is executed. Please see issue #3531 for more details.

The idea for the Canary deployments is to allow users to create a new app instance with new code and target it to see if everything is good. Afterward, it will continue with a rolling deployment of the rest of the instances. So, I expect the behavior to be the same as in the rolling strategy. Another thing that we discussed is that this can potentially cause some problems in a scenario where this new instance receives requests in the middle of a "session" and could be missing some information that is needed on the new version or might not provide the needed information back in the response that will be required for the old version of the app. For now, we assume no significant breaking changes will be supported by this feature. We may consider some Blue/Green deployment strategy for these cases(not accounted for in this proposal). Your example above might also fall into this bucket. Would you agree?

Often I get feedback from CF users that it will be great if rolling strategy could add CF API option (CLI flag) which can trigger service binding re-creation. It means that CF creates a new service binding which is bound to the updated app instances and the old binding is deleted when the deployment is through. I didn't evaluate technically what does it mean to add such a feature but do you see this as a future improvement for the update strategies?

My knowledge in CF is not too vast, but I am assuming that by introducing the options key, we can open the API to add new features that can be enabled by the users when they are pushing their applications.

I like the proposal 👍

Two questions came to my mind:

Does the canary instance participate in app routing? I guess so similar to the standard rolling update. Might be a nice (future) enhancement to offer an option so that the canary instance does not participate in app routing but can only be reached via instance specific routing (or use a separate canary route) until it was successfully evaluated (i.e. the deployment gets continued).

How does the canary strategy behave when the canary instance or later one of the other instances fails on deploying? Will the deployment get canceled = switch back to the last droplet in a non-ZDM way as for canceling a rolling deployment )?

We may consider some Blue/Green deployment strategy for these cases(not accounted for in this proposal). Your example above might also fall into this bucket. Would you agree?

yes, I agree on this.

Does the canary instance participate in app routing? I guess so similar to the standard rolling update. Might be a nice (future) enhancement to offer an option so that the canary instance does not participate in app routing but can only be reached via instance specific routing (or use a separate canary route) until it was successfully evaluated (i.e. the deployment gets continued).

@Gerg and I have been chatting a bit about this as well.

This could be possible with a future enhancement to route destinations to also support an optional process.guid property:
https://v3-apidocs.cloudfoundry.org/version/3.167.0/index.html#the-destination-object

We were thinking mostly in terms of supporting a dedicated "canary route" that only routes to the canary and leaving the main route alone, but that original route would then direct traffic to all process instances still. To do what you're suggesting we'd need to support process.guid and update the original route to include the original process guid and not just type. I think that's doable, but might get pretty complicated if there are multiple simultaneous Deployments.

You may also be able to do something by flagging a process as a canary or something, but that solution feels a little overfit to this problem.

Gerg commented

Having canaries only reachable via dedicated validation routes makes sense as a feature. I agree with Tim that it will probably be relatively easy to make a dedicated validation route for canaries, but more difficult to exclude them from the process's normal route.

Using UpdateDesiredLRP we theoretically should be able to isolate a canary instance to a separate route, and update it once the deployment is promoted. I can see it be handy to have (for example) some easy way of defining a custom canary route in the Deployment create request:

{
  "revision": {
    "guid": "[revision-guid]"
  },
  "strategy": "canary",
  "options": {
    "canary": {
      "route": "my_special_canary_route.example.com"
    },
  ...
}

Of course the problem with this is that it's unclear how it would mesh with CCNG internal routing modeling.

Providing another field like process.guid to route destinations is interesting; would that be orchestrated outside of the DeploymentUpdater (i.e. by the CLI)? Or is it better to have the DeploymentUpdater automatically create/delete that destination (t'd be nice to make an informational annotation on RouteDestinations, but I think we only have metadata on Routes).

In either case, I feel as though it would be a little confusing to have Route Destinations constantly being updated with new processes, instead of a single RouteDestination that doesn't disappear between deployments and users can clearly identify as what is being used for their canary routing. Maybe having Canary Deployments use a special process.type of canary or something would be a way isolating canary instances without altering the RouteDestinations API. That'd certainly make the DeploymentUpdater logic more complicated, though (also, we probably rely on a lot of the special casing web processes have elsewhere in the code). Perhaps a special canary flag on RouteDestinations is enough.

Until we figure this out, canary instances will participate in app routing. It's not ideal, but hopefully we can drive out a solution soon. cc @stephanme