Supporting the multiple canary deployment

Question

Supporting the multiple canary deployment

haoel opened this issue 3 years ago · 12 comments

We need to support multiple different canary releases that are configured with different traffic coloring rules at the same time. This could cause some unexpected behaviors. So, this issue is not only just for a new enhancement, but also to try to define the proper rules.

For example, if there are two services, A and B, where A and B depend on Z. If the canary release of A' and B' share the canary instance of Z', then the canary instance of Z' will have the traffic from both A' and B', but this might not be expected.

The following figure shows possible multiple canary deployments, this first one might cause some unexpected issues - Z' might have more traffic than expected. The second one and the third one are fine. because the different canary traffic a totally separated.

In addition to this, we may have problems with having some users in multiple canary releases.

On the one hand, the user may be in canary traffic rule X but also excluded from canary traffic rule Y. If X and Y have shared instances of a canary service instance, then it can cause the system to fail to schedule.
On the other hand, if a service has multiple canary instances published. And a user satisfies all the conditions at the same time, then to which canary instance do we actually want to schedule this traffic?

Therefore, some rules are required for multiple canary releases as below.

For a canary release (which may have one or more services), only one traffic rule for one deployment.
All the canary releases shouldn't include shared instances of canary services. (P.S, we cloud allow this in some special cases, but we need a reminder to the user there are some instances shared in different deployment)
Traffic rules for multiple canary releases may have the same users, for such kind users, we need to set up all of the traffic coloring tags in their requests.
In order not to affect the performance, the simultaneous canary releases need to be limited. For example 5.
If a service has multiple canary instances at the same time, and the users' requests have been colored for multiple canary instances for one service. The traffic is scheduled according to the priority of the traffic rules.

Answer 1 · 2021-07-27T12:14:11.000Z

Based Best Practices for Canary Deploys
As I know, Compare canary against baseline, not against production. But in the figure the baseline instance traffic is not clearly indicated.

images from Automated Canary Analysis at Netflix with Kayenta

Answer 2 · 2021-07-27T20:41:51.000Z

Agree with liuhu unless you don't need metrics analysis.

In general, experiment launches should be independent with each other. It means that rollout of canary A should have no knowledge of the rollout of canary Z. This would give you great scalability, as it doesn't matter how many canaries you are launching together. It also helps metrics analysis, because each launch could be analyzed independently, with a much smaller scope for investigation.

The rollout stage of A should only determine how much traffic goes to A and A'; And the rollout stage of Z should only determine how much traffic goes to Z and Z', in total.
Say if A' is ramped to 20% and Z' is ramped to 10%, it should means that:

80% traffic use A, of which, 80%*90% use Z and 80%*10% use Z';
20% traffic use to A', of which, 20%*90% use Z and 20%*10% use Z'.

Answer 3 · 2021-07-28T02:42:29.000Z

Thanks for @liuhu & @ranmocy comments here.

It's no problem that A and Z do the canary release individually, which means the A and A' traffic will go to Z instead of Z', and Z' has its own traffic which has nothing to do with A or A'

However, the real world sometimes is a bit complicated. Usually, a feature release could involve many services changes at the same time. They all together serve as a feature. So, we have to do the canary release for all services.

Multiple canary releases at the same could bring big complicity in these aspects:

some of the services could be mixed together.
some of the users could be involved in different traffic rules, and those rules might be conflicted(one rule said you are fine, another said you are not. If these two rules have shared canary service, so we do not know how to schedule the traffic).

So, we'd like to simplify these complicated scenarios and make it more clear, and everyone can be easy to understand.

Answer 4 · 2021-07-28T04:21:59.000Z

I used to have an idea about canary deployment, might be naive. :)

I call it Benzene-Canary

I drew a gram quickly, I hope i make myself clear.
and we can deploy as many canary instances as we wish for each micro service.
we do traffic dispatch on each micro service virtual gateway, could be nginx, could be gloo, traefik, etc.
we can set up rules or logic on virtual gateway as we wish

Answer 5 · 2021-07-28T04:54:55.000Z

I used to have an idea about canary deployment, might be naive. :)

I call it Benzene-Canary

I drew a gram quickly, I hope i make myself clear.
and we can deploy as many canary instances as we wish for each micro service.
we do traffic dispatch on each micro service virtual gateway, could be nginx, could be gloo, traefik, etc.
we can set up rules or logic on virtual gateway as we wish

@redsnapper2006
I think the mesh sidecar can achieve your design above. :-)
Instead, it works with Pod A, A' without introducing a gateway per service.

Answer 6 · 2021-07-28T05:17:17.000Z

I used to have an idea about canary deployment, might be naive. :)

I call it Benzene-Canary
...

@redsnapper2006

This is a kind of simplification, it's good because it separates the services to different domains, and using the dedicated gateway for dedicated canary release.

And as we know the canary release needs coloring or scheduling the traffic base on the user's tags(such as: cookie, token, IP, geo, uid, etc.), so, we need to make sure we bring the user's tags in the whole RPC chains. So, we have to use the Service Mesh or JavaAgent to guarantee this.

Answer 7 · 2021-07-28T06:14:01.000Z

I used to have an idea about canary deployment, might be naive. :)
I call it Benzene-Canary
...

@redsnapper2006

This is a kind of simplification, it's good because it separates the services to different domains, and using the dedicated gateway for dedicated canary release.

And as we know the canary release needs coloring or scheduling the traffic base on the user's tags(such as: cookie, token, IP, geo, uid, etc.), so, we need to make sure we bring the user's tags in the whole RPC chains. So, we have to use the Service Mesh or JavaAgent to guarantee this.

Yes, Sure.
your use cases could be more complicated.
since you are talking about coloring and user's tag, HAProxy can handle user's tags as TCP level, can also work as gateway.

one more thing to mention, my idea can use different component for each virtual gateway, up to your requirement.
instead of solo method Service Mesh or JavaAgent

anyway, my idea is naive. and just an idea. could be long way to be on ground

Answer 8 · 2021-07-28T08:38:31.000Z

Thanks for @liuhu & @ranmocy comments here.

It's no problem that A and Z do the canary release individually, which means the A and A' traffic will go to Z instead of Z', and Z' has its own traffic which has nothing to do with A or A'

I'm confused here. If Z' has its own traffic, and it's nothing to do with A or A', does it come from end user directly? If so, do you have a separable traffic with the same size from the end users directly to Z as well, so that you could compare the metrics between them?

However, the real world sometimes is a bit complicated. Usually, a feature release could involve many services changes at the same time. They all together serve as a feature. So, we have to do the canary release for all services.

Multiple canary releases at the same could bring big complicity in these aspects:

some of the services could be mixed together.

some of the users could be involved in different traffic rules, and those rules might be conflicted(one rule said you are fine, another said you are not. If these two rules have shared canary service, so we do not know how to schedule the traffic).

So, we'd like to simplify these complicated scenarios and make it more clear, and everyone can be easy to understand.

That's exactly the reason why binary releases and feature releases should be decoupled. Binary release should be as no-op as possible, and feature release (guarded by a central controlled feature flag used by all services) could be turn on or off individually after all the required binaries are deployed.
For rule conflicts during feature releases, one potential solution is to config all feature flags in one place. Another solution, which makes all feature releases independent for better scalability, is to ask the rule author to provide a merging strategy and fail some tests or verifications if it's missing.

One step back, I'm really focus on scalability. If your target users are more likely be small to medium size services, you may not need to provide such scalability which introduces complexity to your framework. But again, the complexity is on the framework itself, not on the framework users.

Answer 9 · 2021-07-28T09:16:04.000Z

Let me make some clarifications here. Supposing we have two features that need to do the canary release.

Feature One: A feature release needs to change both A and Z. For example, A represents order service, Z represents an email service, the new feature tries to add new information in order, and the email formation needs to be adjusted for this new information. So, we want to canary this feature for specific users, let's say Android users. So, we need to make sure the canary users can make the order with new information and the notification email also is a new formation.
Feature Two: During Feature One release, another team also wants to change the Z service - email service. they are not going to change email formation, but they try to change the email provider(non-functional changes). So, they want another canary release for email service(which could include Feature One or not), but the user selection is different, it's only for the user who is located in a small town, whatever they use Android or not.

Feature One and Feature Two are driven by a different group, and from the engineering perspective, it should be no problem to support canary release for two features at the same time.

Hope this explanation makes sense.

Answer 10 · 2021-07-28T10:39:08.000Z

I think your "canary release" is a combination of my "binary release" and "feature release". And what I'm trying to describe is a way to decouple these two things.

What I mean is that the production code should look like this:
In A:

if (featureController.isFeatureOneEnabled()) {
  // New code path for feature one in A
} else {
  // Old code path in A
}

In Z:

if (featureController.isFeatureOneEnabled()) {
  // New code path for feature one in Z
} else {
  // Old code path in Z
}
...
if (featureController.isFeatureTwoEnabled()) {
  // New code path for feature two
} else {
  // Old code path
}

In this way, when you roll out a new binary, you don't need to care about any feature at all. And once the binary is deployed, feature team could decide to rollout the feature one or two at any time, separately or jointly. All the releases are decoupled, including the binary release of A, binary release of Z, feature release of One, feature release of Two.

Answer 11 · 2021-07-29T03:56:21.000Z

I think your "canary release" is a combination of my "binary release" and "feature release". And what I'm trying to describe is a way to decouple these two things.

What I mean is that the production code should look like this:
In A:
if (featureController.isFeatureOneEnabled()) {
  // New code path for feature one in A
} else {
  // Old code path in A
}
In Z:
if (featureController.isFeatureOneEnabled()) {
  // New code path for feature one in Z
} else {
  // Old code path in Z
}
...
if (featureController.isFeatureTwoEnabled()) {
  // New code path for feature two
} else {
  // Old code path
}
In this way, when you roll out a new binary, you don't need to care about any feature at all. And once the binary is deployed, feature team could decide to rollout the feature one or two at any time, separately or jointly. All the releases are decoupled, including the binary release of A, binary release of Z, feature release of One, feature release of Two.

Honestly, I do not like code goes to that way.
You are right, all releases are decoupled. But we still need a central component to manage the feature switch for each release, right? It looks like your code is all-in-one. I do not think it is CICD friendly.
from CICD side, I prefer this way.
Branch Feature One -> Build -> Deploy -> Feature One Pod
Branch Feature Two -> Build -> Deploy -> Feature Two Pod
Branch Main -> Build -> Deploy -> Main Regular Pod

then service mesh schedule traffic to above instances by colors or other rule.

Answer 12 · 2021-12-16T01:37:27.000Z

We've implemented this feature via #104 (Add service canary). The issue will be closed after #108 is merged