[Feature] Add optional hashing functions for different experimental settings

Question

[Feature] Add optional hashing functions for different experimental settings

victor-mariano-leite opened this issue 3 years ago · 6 comments

victor-mariano-leite commented 3 years ago

The current implementation of Flagr seems to use a CRC32 mapping to generate the entities hash, but as far as I know, CRC32 is commonly used for other purposes instead of hashing for A/B tests since as the randomization units scale, the likelihood of the collisions increase more than MD5 for example, potentially generating sample ratio mismatches.

To validate this scenario, I've gathered a sample experiment in my company.

First, we've wanted to create unique tests, isolating entities between variants inside a flag this way avoiding confounding effects over experiments, although as far as I know, there is no isolation between flags, which is a problem we couldn't solve effectively yet. Seems to us, that the current architecture more appropriate for multivariate experiment, instead of A/B testing.

To do this, we've been creating one flag per feature and N variants from the start, to avoid re-allocation of entities when we create a new variant. Else, a user in Control could go to Treatment, a behavior we validated creating A/Control variant, introduced a B variant later, and saw that some users from A went to B. So when we are A/B testing in one feature there are 20 variants in the flag, with only 2 variants turned on, Control and Treatment, and possibly an Out-Of-Test variant.

We store all of Flagr data on our Data Lake, so I've gathered the sample ratio of a particular experiment (Control/Treatment), and running a Qui-Square test, seems that there is a mismatch on the sample sizes that is not at random.

I suppose this is because of CRC32, but I'm not sure, is there any way to validate this more consistently on Flagr?

I've seen MD5 or a Jenkins Hash Function are implemented to assign units to it's variants, since they are collision resistant.

Anyway, for flexibility and more general use cases, it would be interesting if we could to choose the randomization algorithm, right-sizing it for ones use cases.

Answer 1 · 2021-08-17T03:31:55.000Z

The closest research I can find is https://michiel.buddingh.eu/distribution-of-hash-values#summary, which has a summary section says CRC32 is a good choice:

The value space of CRC32 is smaller than MD5 or other crypto hashing functions. But collision doesn't affect its distribution as demonstrated by the experiments by the article. In fact, the maximum buckets Flagr supports is 1000, which is significantly smaller than CRC32's range, in an ideal world when the input entity_id has enough entropy, the actual distribution will approximate the distribution you set in the segment. CRC32 is way faster than other crypto hashing functions with native cpu instructions.

That said, I agree there's a flexibility need to run different hash functions, and there's room to provide more experimentation results on various input.

Answer 2 · 2021-08-17T05:40:51.000Z

@victor-mariano-leite I added a test in the openflagr repo to verify my hypothesis #35

I would also check the distribution of entity_id from the input data. Due to the deterministic hashing of entity_id, an extrieme example is that if you pass 60% 0 and 40% 1 as the entity_id and expect a 50%/50% split, that's not how it works.

Answer 3 · 2021-08-20T00:37:05.000Z

@zhouzhuojie

Nice! And interesting point, is it a bad practice to use sequential ids (such as autogenerated SQL ids) as the entity_id?

That is our case, and I was thinking that if the likelihood of older users to be assigned to an experiment is higher (since of our trigger to assign the user to the variant bucket is more commonly used by retained users), maybe the split is biased there also.

Answer 4 · 2021-08-20T22:15:33.000Z

I've tried plotting the distribution of the entity_id's by various variant_id's, some experiments looks like quite right skewed:

As far as I understood what you said, this would potentially result and explain a uneven split, is that so?

Answer 5 · 2022-08-26T21:22:31.000Z

Stale issue message

Answer 6 · 2023-02-27T18:05:17.000Z

Hi @victor-mariano-leite, @zhouzhuojie I am very interested in this issue, because I am trouble shooting something similar right now. Did you end up finding that there were correlations between treatment assignment using this hash function that would bias your experiments?