Canary Deployments Proposal
sethboyles opened this issue · 7 comments
CF Canary Deployments
Authors: @sethboyles @pivotalgeorge @joaopapereira
Reviewers: @Gerg @tcdowney @Samze
CF Canary Deployments
Draft Proposal
Authors: Seth Boyles George Gelashvili Joao De Almeida Pereira
Reviewers: Greg Cobb
Feature goals
Canary Deployments will allow App Developers to create a new Application Instance with a new version of their application. By monitoring the Canary Instance, App Developers will be able to ensure the reliability of this Canary Instance before promoting the new version to the rest of their application instances.
How to Use It
User Workflow
Basic Usage
Starting a Canary Deployment
App Developers will be able to push an application with Canary Deployments by using the --strategy
flag on the CLI, similar to Rolling Deployments:
$ cf push myapp --strategy=canary
(this flag will also be available for other actions like restart
and restage
)
Once the Canary Deployment has brought up the canary instance, the CLI will exit.
Observing the Canary Instance
App Developers will be able to monitor the canary instance’s status by:
-
monitoring app logs by a tag identifying which logs originate from the Canary Instance
-
routing requests to the canary instance directly
Promoting the New Version
If the App Developer determines that the Canary Instance is reliable and wants to promote the new app version to all instances, they would execute the cf continue-deployment
command:
cf continue-deployment myapp
Cloud Controller will then promote the rest of the Application Instances to the new app version.
Canceling the Canary Deployment
If the App Developer determines that the Canary Instance is NOT reliable, they would execute the cf cancel-deployment
command:
cf cancel-deployment myapp
Cloud Controller will then tear down the Canary Instance.
Technical Behavior Overview
-
Upon creation, a Canary Deployment brings up a single ‘Canary Instance’ of the new process.
-
The Deployment will pause, awaiting external evaluation.
-
An App Operator (Admin, Space Developer, or Space Supporter) will indicate that the Canary Instance has a) passed evaluation or b) failed evaluation
-
If the Canary Instance has failed evaluation, the App Operator will Cancel the Deployment to rollback to the previous revision
-
If the Canary Instance has passed evaluation, the App Operator will Continue the Deployment to promote the new revision
CF CLI
Initiating a Canary Deployment
Currently the CF CLI supports passing --strategy=rolling
option to the push, restage, and restart commands to use a Rolling Deployment.
Similarly, adding support for passing canary
to the --strategy
flag would allow App Operators to deploy with a Canary Deployment.
$ cf push myapp --strategy=canary
Like Rolling Deployments, the CF CLI will poll CAPI for updates on the deployment’s progress. Unlike with Rolling Deployments, where the CLI waits until the Deployment reaches the FINALIZED
status before exiting, the CLI will instead wait until the Canary Deployment has reached the PAUSED
status. Once the deployment is paused, the CLI will prompt the user to call the continue-deployment
command and exit.
$ cf push myapp --strategy=canary
# ...streamed staging output..
Starting canary deployment for app myapp...
Waiting for canary instance to deploy...
Canary instance starting...
name: myapp
requested state: started
routes: myapp.example.com
last uploaded: Tue 21 May 20:50:44 UTC 2024
stack: cflinuxfs4
buildpacks:
name version detect output buildpack name
ruby_buildpack 1.10.5 ruby ruby
type: web
sidecars:
instances: 3/3
memory usage: 1024M
start command: bundle exec rackup config.ru -p $PORT -o 0.0.0.0
state since cpu memory disk logging details
#0 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
#1 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
#2 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
type: web
sidecars:
instances: 1/1
memory usage: 1024M
start command: bundle exec rackup config.ru -p $PORT -o 0.0.0.0
state since cpu memory disk logging details
#0 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
Canary Deployment PAUSED.
Please run `cf continue-deployment myapp` to promote the canary deployment, or `cf cancel-deployment myapp` to rollback to the previous version.
Continuing a Canary Deployment
Upon executing the cf continue-deployment
command, the CLI will call the deployment’s continue action and the Deployment will proceed until completion like a Rolling Deployment.
$ cf continue-deployment myapp
Continuing app myapp in org org / space space as admin...
Waiting for app to deploy...
Instances starting...
Instances starting...
Instances starting...
name: myapp
requested state: started
routes: myapp.example.com
last uploaded: Tue 21 May 20:50:44 UTC 2024
stack: cflinuxfs4
buildpacks:
name version detect output buildpack name
ruby_buildpack 1.10.5 ruby ruby
type: web
sidecars:
instances: 3/3
memory usage: 1024M
start command: bundle exec rackup config.ru -p $PORT -o 0.0.0.0
state since cpu memory disk logging details
#0 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
#1 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
#2 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
Surfacing Canary Deployment Status
App operators will be able to discover if an app is currently PAUSED during a Canary Deployment by calling the cf app
command:
$ cf app myapp
# Streamed staging output
name: myapp
requested state: started
routes: myapp.example.com
last uploaded: Tue 21 May 20:50:44 UTC 2024
stack: cflinuxfs4
buildpacks:
name version detect output buildpack name
ruby_buildpack 1.10.5 ruby ruby
type: web
sidecars:
instances: 3/3
memory usage: 1024M
start command: bundle exec rackup config.ru -p $PORT -o 0.0.0.0
state since cpu memory disk logging details
#0 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
#1 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
#2 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
type: web
sidecars:
instances: 1/1
memory usage: 1024M
start command: bundle exec rackup config.ru -p $PORT -o 0.0.0.0
state since cpu memory disk logging details
#0 running 2024-05-21T20:50:56Z 0.4% 46M of 1G 129.6M of 1G 0/s of unlimited
Canary Deployment PAUSED.
Please run `cf continue-deployment myapp` to promote the canary deployment, or `cf cancel-deployment myapp` to rollback to the previous version.
Additionally, this might be an opportune time to add other Deployment information with the cf app,
(such as Rolling Deployments, or Canary Deployments pre/post the PAUSED step)
CAPI
Creating a Canary Deployment
Creating a Canary Deployment will use the strategy property of the Deployment resource. Instead of the value rolling
(which currently is the only valid value), clients will set the value to canary
to create a Canary Deployment.
POST https://api.example.org/v3/deployments
Canary Deployment JSON example:
{
"droplet": {
"guid": "[droplet-guid]"
},
"strategy": "canary",
"relationships": {
"app": {
"data": {
"guid": "[app-guid]"
}
}
}
}
Upon creation, a Canary Deployment will immediately bring up a single Canary Instance of the app’s new revision or droplet.
Monitoring Progress of a Canary Deployment
A new status.reason
on the Deployment object, PAUSED
, will be introduced to track the Canary Deployment’s state as the Canary instance is evaluated.
"status": {
"value": "ACTIVE",
"reason": "PAUSED",
"details": {
"last_successful_healthcheck": "2024-04-25T22:42:10Z"
// New property to show when Canary Deployment was transitioned to PAUSED
"last_status_change": "2024-04-25T22:32:10Z"
}
},
(see The Deployment Object in the V3 apidocs for more information on the status
field)
Once in the PAUSED
state, Canary Deployments will remain PAUSED
until an App Operator has indicated external evaluation has passed and the deployment is ready to proceed.
Initially there will be no timeout–Canary Deployments can remain PAUSED
indefinitely. See [Configurable Timeout] under [Possible Future Enhancements]
Promoting Canary Deployments
Once the App Operator has determined they would like to promote the Canary Deployment, they will call an action endpoint (see action to cancel a deployment for an existing action)
POST https://api.example.org/v3/deployments/[deployment-guid]/actions/continue
Once the Canary Deployment’s continue action has been called, the Deployment will transition from PAUSED
to DEPLOYING
The Canary Deployment will proceed similar to a Rolling Deployment (that is, 1 new instance will be brought up and 1 old instance will be brought down in serial, repeated with no further pausing until the Deployment is complete).
Canceling Canary Deployments
App Operators will be able to use the existent Cancel Deployment API action to rollback a Canary Deployment with current status ACTIVE
and reason PAUSED.
Supersedence of Canary Deployments
Like Rolling Deployments, Canary Deployments can be superseded by a Deployment created before the Canary Deployment has finished.
If a Canary Deployment has status.value
of ACTIVE
, then the Deployment can be superseded, even if the status.reason
is PAUSED
.
Possible Future Enhancements
While the above proposal is kept as feature-minimal as possible while meeting the needs of a basic Canary Deployment, App Operators may eventually expect more control over their Deployment strategies. The following are potential ways Canary Deployments and Deployments in general can be enhanced.
To support various configurable options specific to the deployment strategy
, a new options
property could be added to the Create Deployment request:
{
"revision": {
"guid": "[revision-guid]"
},
"strategy": "canary",
"options": {
"canary_options": {
"steps": 3,
"instances_per_step": 4,
"step_timeout": 600, // timeout in seconds
},
"max_in_flight": 2,
}
...
}
Configurable Number of Canary Steps
App Operators may wish to perform multiple evaluations of a Canary Deployment.
"steps": 3 // Default: 1
A Canary Deployment with a step
value of 3 would transition to PAUSED
3 times throughout the entire rollout. The Canary Deployment would require the App Operator to call the continue
action 3 times before fully promoting the canary.
A step
value of NULL
would require the App Operator to evaluate the entire rollout.
Configurable Number of Canary Instances per Step
"instances_per_step": 4,
An instances_per_step
property would allow multiple Canary Instances to be brought up before the Deployment is PAUSED
for evaluation.
Configurable Step Weights (alternative to Canary Steps/Canary Instances per Step)
A single configurable value, stepWeights
, could be an alternative to configuring instances_per_step
and steps.
"step_weights": [20, 40, 50, 100]
A Canary Deployment with the above step_weights,
would roll out 20% of instances, then 40% (total), 50%, 100%, pausing at each step for evaluation.
Configurable Max-in-Flight
"max_in_flight": 3,
Note: break this out into new document
Orthogonal to Canary Deployments, max_in_flight
is also applicable to Rolling Deployments. A Deployment with max_in_flight
of 3 would simultaneously bring up 3 new instances at once, and tear down one old instance as each new instance is brought up.
This, however, is complicated by the Canary Deployments PAUSED
state–would the teardown of instances wait until after the Deployment’s continue
action has been called?
NOTE: Distinction between ‘Instances per Step’ and ‘Max-in-Flight’
instances_per_step
and max_in_flight
differ in purpose/behavior:
-
max_in_flight
: number of instances CC will request Diego to bring up/down at once. (a value that could be applied to both Rolling Deployments and Canary Deployments) -
instances_per_step
: number of instances to rollout before pausing for evaluation (A Canary Deployment specific value)
A Canary Deployment with instances_per_step
of 10, but max_in_flight
of 1, would create a slow rollout that paused after 10 canary instances were brought up.
Configurable Timeout
An optional configurable timeout property named step_timeout
could be added to the Deployment resource:
{
...
"strategy": "canary",
"options": {
"canary_step_timeout": 600 // timeout in seconds
}
...
}
If the timeout is reached without the Canary Deployment having been progressed via the “continue” action endpoint, the deployment would automatically be canceled and rolled back to the previous revision (i.e. the single canary instance will be taken down)
The name step_timeout
is chosen as opposed to timeout
to clarify the timeout is not a generic timeout that could apply to the entire deployment lifecycle, or to other deployment types, like rolling.
Support for automatic evaluation of SLOs
Automatic rollback of a Canary deployment based on app metrics such as HTTP request success rate, response time, or other or custom metrics will likely require large cross CF-component changes to support.
Deployment Specific Routing
Currently App Instance routing does not work with multiple processes. To allow for such features as keeping a subset of user sessions only on the old/new deployment instances, we would need to fix instance-based routing and expand it to support instances from different processes.
Mirroring of Idempotent Requests
Traffic mirroring (i.e. mirror traffic to from each incoming request, sending one request to the new app version and one to the old, as a way of measuring the new version without impacting user experience) would require to ability to route to individual app instances and also likely require large cross CF-component changes to support.
Thank you for the useful proposal! I have two questions:
- Is there a difference in the behaviour between the
canary
compared torolling
strategy when the new app version introduces changes to the app environment or new service bindings? What happens with the changed app environment or the new service bindings whencancel-deployment
is executed. Please see issue #3531 for more details. - Often I get feedback from CF users that it will be great if
rolling
strategy could add CF API option (CLI flag) which can trigger service binding re-creation. It means that CF creates a new service binding which is bound to the updated app instances and the old binding is deleted when the deployment is through. I didn't evaluate technically what does it mean to add such a feature but do you see this as a future improvement for the update strategies?
Is there a difference in the behaviour between the canary compared to rolling strategy when the new app version introduces changes to the app environment or new service bindings? What happens with the changed app environment or the new service bindings when cancel-deployment is executed. Please see issue #3531 for more details.
The idea for the Canary deployments is to allow users to create a new app instance with new code and target it to see if everything is good. Afterward, it will continue with a rolling
deployment of the rest of the instances. So, I expect the behavior to be the same as in the rolling
strategy. Another thing that we discussed is that this can potentially cause some problems in a scenario where this new instance receives requests in the middle of a "session" and could be missing some information that is needed on the new version or might not provide the needed information back in the response that will be required for the old version of the app. For now, we assume no significant breaking changes will be supported by this feature. We may consider some Blue/Green deployment strategy for these cases(not accounted for in this proposal). Your example above might also fall into this bucket. Would you agree?
Often I get feedback from CF users that it will be great if rolling strategy could add CF API option (CLI flag) which can trigger service binding re-creation. It means that CF creates a new service binding which is bound to the updated app instances and the old binding is deleted when the deployment is through. I didn't evaluate technically what does it mean to add such a feature but do you see this as a future improvement for the update strategies?
My knowledge in CF is not too vast, but I am assuming that by introducing the options
key, we can open the API to add new features that can be enabled by the users when they are pushing their applications.
I like the proposal 👍
Two questions came to my mind:
Does the canary instance participate in app routing? I guess so similar to the standard rolling update. Might be a nice (future) enhancement to offer an option so that the canary instance does not participate in app routing but can only be reached via instance specific routing (or use a separate canary route) until it was successfully evaluated (i.e. the deployment gets continued).
How does the canary strategy behave when the canary instance or later one of the other instances fails on deploying? Will the deployment get canceled = switch back to the last droplet in a non-ZDM way as for canceling a rolling deployment )?
We may consider some Blue/Green deployment strategy for these cases(not accounted for in this proposal). Your example above might also fall into this bucket. Would you agree?
yes, I agree on this.
Does the canary instance participate in app routing? I guess so similar to the standard rolling update. Might be a nice (future) enhancement to offer an option so that the canary instance does not participate in app routing but can only be reached via instance specific routing (or use a separate canary route) until it was successfully evaluated (i.e. the deployment gets continued).
@Gerg and I have been chatting a bit about this as well.
This could be possible with a future enhancement to route destinations to also support an optional process.guid
property:
https://v3-apidocs.cloudfoundry.org/version/3.167.0/index.html#the-destination-object
We were thinking mostly in terms of supporting a dedicated "canary route" that only routes to the canary and leaving the main route alone, but that original route would then direct traffic to all process instances still. To do what you're suggesting we'd need to support process.guid
and update the original route to include the original process guid and not just type. I think that's doable, but might get pretty complicated if there are multiple simultaneous Deployments.
You may also be able to do something by flagging a process as a canary or something, but that solution feels a little overfit to this problem.
Having canaries only reachable via dedicated validation routes makes sense as a feature. I agree with Tim that it will probably be relatively easy to make a dedicated validation route for canaries, but more difficult to exclude them from the process's normal route.
Using UpdateDesiredLRP we theoretically should be able to isolate a canary instance to a separate route, and update it once the deployment is promoted. I can see it be handy to have (for example) some easy way of defining a custom canary route in the Deployment create request:
{
"revision": {
"guid": "[revision-guid]"
},
"strategy": "canary",
"options": {
"canary": {
"route": "my_special_canary_route.example.com"
},
...
}
Of course the problem with this is that it's unclear how it would mesh with CCNG internal routing modeling.
Providing another field like process.guid
to route destinations is interesting; would that be orchestrated outside of the DeploymentUpdater (i.e. by the CLI)? Or is it better to have the DeploymentUpdater automatically create/delete that destination (t'd be nice to make an informational annotation on RouteDestinations, but I think we only have metadata on Routes).
In either case, I feel as though it would be a little confusing to have Route Destinations constantly being updated with new processes, instead of a single RouteDestination that doesn't disappear between deployments and users can clearly identify as what is being used for their canary routing. Maybe having Canary Deployments use a special process.type
of canary
or something would be a way isolating canary instances without altering the RouteDestinations API. That'd certainly make the DeploymentUpdater logic more complicated, though (also, we probably rely on a lot of the special casing web
processes have elsewhere in the code). Perhaps a special canary
flag on RouteDestinations is enough.
Until we figure this out, canary instances will participate in app routing. It's not ideal, but hopefully we can drive out a solution soon. cc @stephanme