[Bug]: Boolean threshold percentage is not respected in local evaluation mode

Bug Description

Boolean threshold flag does not respect the provided percentage in local evaluation mode.

Version Info

/app $ /flipt -v

    _________       __
   / ____/ (_)___  / /_
  / /_  / / / __ \/ __/
 / __/ / / / /_/ / /_
/_/   /_/_/ .___/\__/
         /_/

Version: v1.38.2
Commit: 586a30ca4cf3613bf19646288806e42d7235bb27
Build Date: 2024-03-15T16:33:00Z
Go Version: go1.21.8
OS/Arch: linux/amd64

node -v
v20.10.0

Search

I searched for other open and closed issues before opening this

Steps to Reproduce

flipt configuration

version: "1.2"
namespace: e2e
flags:
- key: prd0002-threshold-evaluation
  name: threshold-evaluation
  type: BOOLEAN_FLAG_TYPE
  enabled: false
  rollouts:
  - threshold:
      percentage: 50
      value: true

test

import { afterAll, beforeAll, beforeEach, describe, expect, it } from 'vitest';
import { FliptEvaluationClient } from '@flipt-io/flipt-client';
import * as uuid from 'uuid';

describe('threshold', () => {
  const flagKey = 'prd0002-threshold-evaluation';
  describe(flagKey, () => {
    const n = 10000;
    const p = 0.025;
    const expected: Record<string, number> = { 'true': 0.5, 'false': 0.5 };
    const context = {};
    let client: FliptEvaluationClient;
    let observed: Record<string, number> = {};

    beforeAll(() => {
      client = new FliptEvaluationClient('e2e', { url: 'http://localhost:8080' });
    });

    afterAll(() => {
      client.close();
    });

    beforeEach(async () => {
      for (let i = 0; i < n; i++) {
        const response = client.evaluateBoolean(flagKey, uuid.v4(), context);
        const value = response.result!.enabled.toString();
        observed[value] = (Object.keys(observed).includes(value) ? observed[value] : 0) + 1 / n;
      }
    });

    it('should match distribution', () => {
      expect(Object.keys(observed)).toEqual(expect.arrayContaining(Object.keys(expected)));
      Object.keys(expected).forEach((k) => {
        expect(Math.abs(observed[k] - expected[k])).toBeLessThan(p);
      });
    });
  });
});

Expected Behavior

I would expect to have a 50/50 distribution on large number of evaluation.

Additional Context

No response

Hey @SamuelCabralCruz could you perhaps share what tolerance you're seeing? (i.e. the observed p).
I see you've got 0.025 as a constant there for the max delta, just curious what the actual observed delta is.
Wondering if is it wildly out or if that tolerance just too rigid and low to be expected for N == 10000.

One observation I am having while taking a walk through both the Go and Rust implementations of our evaluators, is an off by one discrepancy between the two. In Go, we increment the searched for bucket by 1 before doing a binary search for the target, whereas in Rust we do not.

In Go:
https://github.com/flipt-io/flipt/blob/3f8e7bd62714cd7643a12ad9992d8eab56654d5a/internal/server/evaluation/legacy_evaluator.go#L197-L200

In Rust:

flipt-client-sdks/flipt-evaluation/src/lib.rs

Lines 414 to 417 in c870463

    
           let index = match buckets.binary_search(&(bucket as i32)) { 
        
               Ok(idx) => idx, 
        
               Err(idx) => idx, 
        
           };

I wonder how this accumulates for large N in terms of the final distribution.

Im also wondering if this is to account for slight differences between Go's sort.SearchInts and Rust's binary_search.
Looking at the two, Go will return either the location of the found int or the insertion location if not found.
Whereas, Rust returns a Result which disambiguates the two situations. In both cases however, the enclosed results points to either the match or the highest value lower than the search target (meaning the index before insertion).

I was wrong there, Rust returns the higher offset and matches Go.

TL;DR I do think there are some subtle, yet potentially accumulative differences here that we should rectify.

Im going illustrate a boundary scenario that I think would return different results between these two.

Given a distribution [50, 50], we would translate this into a range of [0, 1000] like so [500, 1000] in both implementations. When then take a crc of the entityID + flagKey and modulo that into the bucket range.
Imagine the result of this hash is right on the boundary. In this case 500.

In Go, we would incrememned 500 to 501 before searching an array with two entries [500, 1000].
In this situation, Go returns 1.

Whereas, in Rust we would ultimately do vec![500, 1000].binary_search(&500), which results in Ok(0).

So, we do have an off by one discrepancy here between the two, which might be what is throwing out your test.
But knowing that delta would be great.

Hey @SamuelCabralCruz could you perhaps share what tolerance you're seeing? (i.e. the observed p). I see you've got 0.025 as a constant there for the max delta, just curious what the actual observed delta is. Wondering if is it wildly out or if that tolerance just too rigid and low to be expected for N == 10000.

I reproduced this and got a delta of around 20 😬

Re: the off-by-one discrepancy, I think you did find a bug but in the variant_evaluaton flow. The boolean_evaluation doesn't use the same bucket search method

One thing I want to do is try to test the logic/algorithm for a large N to see if our assumptions are wrong in both the Go and Rust logic

Here it is in Go: https://github.com/flipt-io/flipt/blob/3f8e7bd62714cd7643a12ad9992d8eab56654d5a/internal/server/evaluation/evaluation.go#L183-L188

We could write a test that runs only this code in a loop to see if it gives a 50/50 distribution in Go, and then do the same for the Rust code:

flipt-client-sdks/flipt-evaluation/src/lib.rs

Lines 475 to 484 in c870463

    
           let normalized_value = (crc32fast::hash( 
        
               format!( 
        
                   "{}{}", 
        
                   evaluation_request.entity_id, evaluation_request.flag_key 
        
               ) 
        
               .as_bytes(), 
        
           ) as i32 
        
               % 100) as f32; 
        
           if normalized_value < threshold.percentage {

If it's correct in Go and wrong in Rust then there's a problem with our translation (I'm a bit weary of the casting that we do and order of operations here:

flipt-client-sdks/flipt-evaluation/src/lib.rs

Lines 481 to 482 in c870463

    
           ) as i32 
        
               % 100) as f32;

)

If its wrong in Go too, then the overall algorithm is incorrect

Reproduced here: https://github.com/flipt-io/flipt-client-node-example/pull/1

I did a seemingly equivalent experiment in Go 1000 times over and averaged the p's to get around 0.004 on average:

package main

import (
	"fmt"
	"hash/crc32"
	"math"

	"github.com/google/uuid"
)

var n = 10000.0

func main() {
	var ps []float64
	for j := 0; j < 1000; j++ {
		var counts [2]float64
		for i := 0.0; i < n; i++ {
			x := (int(crc32.ChecksumIEEE([]byte(uuid.New().String()+"foo"))) % 1000) < 500
			idx := 0
			if x {
				idx = 1
			}
			counts[idx] = counts[idx] + (1.0 / n)
		}

		ps = append(ps, math.Abs(counts[0]-0.5))
	}

	var avg float64
	for _, f := range ps {
		avg += f
	}

	fmt.Println("Average P is", avg/float64(len(ps)))
}

go run main.go
Average P is 0.004024000000000027

Where I alter the range boundary by 1 to simulate the off by 1 we see in Rust, the result is almost identical. Definitely not 20.

This is not a perfect reproduction of our bucketing in Go either, but at least suggests a 0.025 as chosen by @SamuelCabralCruz is a respectible error margin that we should be within in our Rust implementation. Our Go one could still be wrong too.

Equivalent Rust produces similar results:

fn main() {
    let n: i64 = 10000;
    let mut counts = [0f64; 2];
    for _ in 0..n {
        let uuid = uuid::Uuid::new_v4();
        let uuid_string = uuid.to_string();
        let entity = format!("foo{}", uuid_string);

        let hash = crc32fast::hash(entity.as_bytes()) % 100;
        counts[(hash < 50) as usize] += 1.0 / n as f64;
    }

    println!("{} {}", (counts[0] - 0.5).abs(), (counts[1] - 0.5).abs());
}

➜ cargo run src/main.rs
Compiling disthash v0.1.0 (/Users/georgemac/github/georgemac/disthash)
Finished dev [unoptimized + debuginfo] target(s) in 0.31s
Running target/debug/disthash src/main.rs
0.002999999999960923 0.0030000000000384164

So if you are seeing p ~ 20 then something seems quite wrong in our Rust evaluator.

@markphelps you were pointing at the smoking gun, when I change this code above like so:

fn main() {
    let n: i64 = 10000;
    let mut counts = [0f64; 2];
    for _ in 0..n {
        let uuid = uuid::Uuid::new_v4();
        let uuid_string = uuid.to_string();
        let entity = format!("foo{}", uuid_string);
        
-       let hash = crc32fast::hash(entity.as_bytes()) % 100;
+       let hash = crc32fast::hash(entity.as_bytes()) as i32 % 100;
        counts[(hash < 50) as usize] += 1.0 / n as f64;
    }

    println!("{} {}", (counts[0] - 0.5).abs(), (counts[1] - 0.5).abs());
}

➜ cargo run src/main.rs
Compiling disthash v0.1.0 (/Users/georgemac/github/georgemac/disthash)
Finished dev [unoptimized + debuginfo] target(s) in 0.34s
Running target/debug/disthash src/main.rs
0.25340000000001084 0.25339999999993335

Yep, the u32 is (often in fact) larger that max i32 and wraps around when cast.

GH autoclosed this, but this should be fixed in the next client SDK release 👍
Thanks for raising it @SamuelCabralCruz !

@SamuelCabralCruz I just released v0.3.3 (java) and v0.4.3 (node and other client SDKs) with the fix! thanks for reporting

and thanks @GeorgeMac for the fix!

@markphelps @GeorgeMac Thanks a lot for quick support 💯 ❤️ Really appreciated.

	let index = match buckets.binary_search(&(bucket as i32)) {
	Ok(idx) => idx,
	Err(idx) => idx,
	};

	let normalized_value = (crc32fast::hash(
	format!(
	"{}{}",
	evaluation_request.entity_id, evaluation_request.flag_key
	)
	.as_bytes(),
	) as i32
	% 100) as f32;

	if normalized_value < threshold.percentage {