Unleash/unleash

Gradual rollout strategy not exposing to all callers when set to 100% exposure

jsecor-zus opened this issue ยท 6 comments

Describe the bug

When using the gradual rollout strategy with Rollout set to 100%, we noticed that some traffic still receives the control behavior and shows up as "not exposed" in the unleash UI. This is only occurring with roughly 0.007% of requests (23 of 332,833 in the last 48 hours).

This appears to happen when the number of requests is greater than ~2,500 in a given hour. It doesn't happen every time at that volume, but once greater than 10k, it does appear to happen every time, from our limited sample size.

We are overdue to upgrade our version of Unleash, but I didn't see a previous mention of this bug in the issues or releases. I am planning to upgrade and attempt to reproduce on the latest version.

Steps to reproduce the bug

  1. Create a feature release toggle
  2. Add a "Gradual rollout" strategy
  3. Set the rollout percentage to 100
  4. Send several thousand requests to the feature toggle
  5. Observe actual percentage exposed

Expected behavior

For every request sent to the feature toggle in a high-volume batch, I would expect to be exposed to the rolled-out feature.

Logs, error output, etc.

see csv provided in additional context

Screenshots

exposure-chart

Additional context

metric-log.csv

Unleash version

4.19.0 (Open Source)

Subscription type

Open source

Hosting type

Self-hosted

SDK information (language and version)

Golang 1.21.1

Hey, @jsecor-zus! Thanks for reporting this ๐Ÿ™๐Ÿผ I'll look into this to see if I can find out what's happening. But also: with this many requests, is it possible that some of the flag checks happen before the client has been properly initialized?

Depending on your setup, it may be that some requests are hitting servers that have just spun up, meaning they haven't actually received a response from Unleash yet. In that case the flag will fall back to being false unless it is bootstrapped in some way. Could that be happening here? Or are you bootstrapping the SDK with data? Or maybe you're not deploying new instances, so it should always be the same instance (and thus also always ready)?

Little update here ๐Ÿ™‹๐Ÿผ I've run a setup with the Go SDK locally and I think that the anomalies you're seeing is because the SDK hasn't received its configuration yet.

I initialized Unleash and immediately started checking the flag (did it in a loop with 30,000 iterations). At about 1.2 million checks, I had 1,049 falses. At 1.4 million this number was the same. This makes me think that the first 1,049 checks were false, but the rest were true.

To follow this up and to check it, I created a new feature with the same gradual rollout 100% configuration. After calling Initialize, I created a condition to only go into a loop if the flag was enabled. Without waiting, the program would exit immediately because the SDK hadn't gotten its config. By adding a 2 second wait, the SDK got the config it needed, and it entered into the loop. Once in, I ran 14 million checks, all of which were true.

Is it possible that what you're seeing is the same kind of scenario?

oh interesting. Thanks for investigating @thomasheartman! Yes it is completely possible. We deploy somewhat frequently and this code runs in an AWS Lambda function, so there is a cold start period to initialize new instances. I just compared the spikes in the chart I included previously with our deployment log and auto-scaling actions and these false events do sort of correspond with Lambda scale-outs. So I think your theory is definitely plausible. I'll do some more testing on my side to see.

If the issue is initializing the connection from the client, would you still expect the false responses to be reported in the Unleash UI? If the client hadn't hit the actual feature toggle logic and short-circuited to false, it would seem more likely that the false value wouldn't even be known by the Unleash server. But that's just what I would expect, so let me know if I'm thinking about the problem incorrectly.

One other question. If it is a problem in initialization, I should be able to reproduce this with any feature strategy, right? Even a simple on/off toggle should exhibit this same behavior? I just want to understand the potential impact, since other teams here follow the same initialization pattern as us and most frequently use simple toggles.

Thanks again for looking into this!

actually, I just checked, and we are still seeing a small number of these even after switching from the gradual rollout to a standard toggle. So that does further back up your initialization theory.

image

Happy to help! And yeah, I think what you're seeing is consistent with this.

If the issue is initializing the connection from the client, would you still expect the false responses to be reported in the Unleash UI? If the client hadn't hit the actual feature toggle logic and short-circuited to false, it would seem more likely that the false value wouldn't even be known by the Unleash server.

Once the client has been initialized, it'll start counting metrics. This means that even before you receive the configuration, it'll create a metrics bucket and start storing any flag checks that happen. The metrics won't be sent until the configured interval, but it'll start collecting them as soon as it can. So yes, even if the checks are before the flag configuration, they'll be counted and sent to the Unleash instance.

One other question. If it is a problem in initialization, I should be able to reproduce this with any feature strategy, right? Even a simple on/off toggle should exhibit this same behavior? I just want to understand the potential impact, since other teams here follow the same initialization pattern as us and most frequently use simple toggles.

As you found, yes, you should be able to reproduce this with any kind of strategy. The default in Unleash is to evaluate a flag as false if we don't have a configuration for it.


Depending on your scenario, this may or may not be an issue. There's a couple ways that you could work around it:

  1. Wait for the initialization and configuration sync before continuing the program. I'm not super familiar with Go, but at least many of our other SDKs (such as Java, .NET, Rust) allow you to block on the initialization until it's ready. Depending on your use case, that might be acceptable, but it might also not be, because it'll make startup slower (even if only by a little).

  2. Bootstrap the SDK from a file. If you know that for certain flags you want to use a specific strategy, you can store that to a file, distribute it with the application, and read it from there on startup. Aside from Rust, all our server-side SDKs support this.

  3. Use a default fallback value for certain flags. All server-side SDKs support providing a default fallback value if the SDK doesn't know the flag you're trying to check. If you set it to true for this flag, it would default to true, even if the SDK hasn't received it's configuration yet.

  4. Just let it be. If this feature is already rolled out to 100% of your users, then it's likely that you'll want to remove the feature flag pretty soon anyway, so it might not be worth investing time into the other solutions. However, if you're using this more as a kill switch (to disable a feature in case something goes wrong), then you might want to consider flipping it, so that the feature is active if the flag is false, and hidden when the kill switch is engaged.

So it all depends on what suits you best in your scenario, really. I'd be happy to help if you've got more questions or suggestions ๐Ÿ˜„

Great, thanks @thomasheartman, that's very helpful information.

I think #1 would be easiest, and I think the WaitForReady function should do the trick. Unfortunately, these initializations occur when our service is under heavy load, so additional startup delay isn't ideal.

I might pursue #2 if I have time, but I'd probably want to set up some automation to call the export API periodically and dump the result in S3. Then bootstrap from S3, which is outlined in the docs.

Like you mention, we prefer not to leave these toggles lying around once they've been enabled 100%, so timely cleanup of toggles will hopefully reduce the issue.

Thanks again for your help!