/abtest-mlops

A/B testing with Machine Learning

Primary LanguageJupyter Notebook

A/B Hypothesis Testing

A/B testing, also known as split testing, refers to a randomized experimentation process wherein two or more versions of a variable (web page, page element, etc.) are shown to different segments of website visitors at the same time to determine which version leaves the maximum impact and drive business metrics.

Sequential testing

A common issue with classical A/B-tests, especially when you want to be able to detect small differences, is that the sample size needed can be prohibitively large. In many cases it can take several weeks, months or even years to collect enough data to conclude a test.

  • The lower number of errors we require, the larger sample size we need.
  • The smaller the difference we want to detect, the larger sample size is required.

Sequential sampling works in a very non-traditional way; instead of a fixed sample size, you choose one item (or a few) at a time, and then test your hypothesis. You can either:

  • Reject the null hypothesis (H0) in favor of the alternate hypothesis (H1) and stop,
  • Keep the null hypothesis and stop,
  • unable to reach either conclusion with current observation and continue sampling.

Advantage of Sequential testing over classic a/b testing

  • optimize necessary observation (sample size)
  • reduce the likelihood of error
  • gives a chance to finish experiments earlier without increasing the possibility of false results

N.B: Unlike classical fixed sample-size tests, where significance is only checked after all samples have been collected, this test will continously check for significance at every new sample and stop the test as soon as a significant result is detected, while still guaranteeing the same type-1 and type-2 errors as the fixed-samplesize test.

Common sequential testing algorithms

The Evan Miller sequential procedure for one-sided test works as follows:

  • choose a sample size (N). Here is the link to define N
  • Assign subjects randomly to the treatment and control, with 50% probability each.
  • Track the number of incoming successes from the treatment group. Call this number (T).
  • Track the number of incoming successes from the control group. Call this number (C).
  • If (T-C) reaches (2\sqrt{N}), stop the test. Declare the treatment to be the winner.
  • If (T+C) reaches (N), stop the test. Declare no winner.

The two-sided test is essentially the same, but with an alternate ending:

  • If (T-C) reaches (2.25\sqrt{N}), stop the test. Declare the treatment to be the winner.
  • If (C-T) reaches (2.25\sqrt{N}), stop the test. Declare the control to be the winner.
  • If (T+C) reaches (N), stop the test. Declare no winner.

N.B: This test completely ignores the number of failures in each group, which makes it significantly easier to implement in low-conversion settings. However

  • If we hit the threshold without having reached statistical proof, we cannot continue the experiment.
  • Even if preliminary evidence says that one version is terrible, we will keep losing conversions until we hit an arbitrarly pre-agreed threshold.

Sequential probability ratio testing (SPRT)

SPRT is based on the likelihood ratio statistic.

Variants of SPRT

  • Wald's test uses one pair of observations at each stage and ignores tied observations and thus reduces a test of a single proportion.
  • used to calculate tied observations b/n two proportions. So, determining statistical significance with two distributional streams of data can be used for conditional sprt.

We focus on conditional SPRT for this challenge.

General steps of conditional SPRT

  1. Calculate critical upper and lower decision boundaries
  2. Perform cummlative sum of the observation
  3. calculate test statistics(likelihood ration) for each of the observations
  4. calculate upper and lower limits for exposed group
  5. apply stopping

Stopping Rule

  1. If the log probability ratio greater than or equal to the upper critical limit then the model reject the null hypothesis with the favor of alternative hypothesis (i.e. accept H1 (conclude that version two is better than version one)) and terminate the test.
  2. If the log probability ratio less than or equal to the lower critical limit then accept the null hypothesis (i.e. conclude that there is no difference between the two groups) and terminate the test.
  3. If neither critical limit is reached, conduct another trial and continue the test.