banditml/offline-policy-evaluation

Confidence intervals on OPE estimates

econti opened this issue · 6 comments

Can do this using bootstrapping.

Hi, thanks for the library! I'm really enjoying working with it.

I'm playing with some designs for the bootstrap intervals at the moment and I've come across a problem that I could do with some guidance on. Since the bootstrap samples are completely randomly distributed, you can end up with samples which only represent one of the actions (in the example below, it will typically be a case where all the logged actions chosen were "allowed"). This is okay for evaluating IPS on that sample, but the ML based methods run into issues with this, since they only have examples of one of the possible states of the target variable to train on.

The options I can think of are:

  1. Only include bootstrap confidence intervals for IPS (Seems like the most viable to me right now).
  2. Allow it for all methods, but include heavy warnings for the ML based methods in the documentation (This seems like a poor user experience and would introduce a bunch of gotchas and random failures in library usage so it feels like a no-go)
  3. Use some other sampling scheme which includes all target classes available in the dataframe somehow (for example rejecting all samples which do not include at least one instance of each target state). You lose the empirical distribution if you do this, which means the confidence intervals would have to be derived differently. I'm not sure if this is theoretically possible (It seems like you're losing information on certain possible distributions of actions which means a CI could not be representative), but said I'd put the option out there in case anyone has ideas for a sampling scheme which guarantees all target classes are represented and has a theoretically sound way to get a CI around the reward.
  4. For the ML methods, train the model once on the full dataframe and only bootstrap samples to determine the expected reward using that model. (I'm not sure how sound this is, but it would ensure samples of all target actions are available for the model training).

I have included a sample script below that should (it's unseeded) trigger the issue.

from typing import Callable, Iterable                                           
                                                                                
from ope.methods.direct_method import evaluate                                  
from pandas import DataFrame                                                    
import pandas as pd                                                             
from scipy.stats import norm                                                    
                                                                                
                                                                                
def confidence_interval(p_value: float) -> Callable:                            
    lower_norm, upper_norm = norm.interval(p_value)                             
                                                                                
    def get_confidence_interval(series: pd.Series):                             
        mean = series.mean()                                                    
        standard_dev = series.std()                                             
        return {                                                                
            "name": f"confidence_interval_{p_value}",                           
            "value": [mean + lower_norm * standard_dev,                         
                      mean + upper_norm * standard_dev],                        
        }                                                                       
                                                                                
    return get_confidence_interval                                              
                                                                                
                                                                                
def bootstrap_samples(dataframe: DataFrame, num_samples: int) -> Iterable[DataFrame]:
    num_rows = dataframe.shape[0]                                               
    for _ in range(num_samples):                                                
        yield dataframe.sample(num_rows, replace=True)                          
                                                                                
                                                                                
def bootstrap_metrics(                                                          
    data: DataFrame,                                                            
    policy: Callable,                                                           
    evaluator: Callable = evaluate,                                             
    num_samples: int = 50,                                                      
):                                                                              
    metrics = [confidence_interval(0.95)]                                       
                                                                                
    outcomes = [                                                                
        evaluator(sample, policy) for sample in bootstrap_samples(data, num_samples)
    ]                                                                           
                                                                                
    outcomes_df = pd.DataFrame(outcomes)                                        
    outputs = {}                                                                
                                                                                
    for col_name, values in outcomes_df.iteritems():                            
        column_metrics = {}                                                     
                                                                                
        for metric_function in metrics:                                         
            metric = metric_function(values)                                    
            column_metrics[metric["name"]] = metric["value"]                    
                                                                                
        outputs[col_name] = column_metrics 
                                                                             
    return outputs                                                              
                                                                                
                                                                                
def action_probabilities(context):                                              
    epsilon = 0.10                                                              
    if context["p_fraud"] > 0.10:                                               
        return {"allowed": epsilon, "blocked": 1 - epsilon}                     
    return {"allowed": 1 - epsilon, "blocked": epsilon}                         
                                                                                
                                                                                
df = pd.DataFrame([                                                             
        {"context": {"p_fraud": 0.08}, "action": "blocked", "action_prob":      
         0.90, "reward": 0},                                                    
        {"context": {"p_fraud": 0.03}, "action": "allowed", "action_prob":      
         0.90, "reward": 20},                                                   
        {"context": {"p_fraud": 0.02}, "action": "allowed", "action_prob":      
         0.90, "reward": 10},                                                   
        {"context": {"p_fraud": 0.01}, "action": "allowed", "action_prob":      
         0.90, "reward": 20},                                                   
        {"context": {"p_fraud": 0.09}, "action": "allowed", "action_prob":      
         0.10, "reward": -20},                                                  
        {"context": {"p_fraud": 0.40}, "action": "allowed", "action_prob":      
         0.10, "reward": -10},                                                  
     ])                                                                     
                                                                                
print(bootstrap_metrics(df, action_probabilities)) 

Thanks so much for such a thorough thought process and for putting so much time thinking through those 4 solutions, it's really appreciated.

You're totally right, this is an issue for the ML methods if the data size is small (it probably isn't an issue when the dataset is large since it's unlikely the sampled data will have any actions with 0 data).

What do you think about doing this as another option:

Have the bootstrapping parameters be configurable. We could have a proportion parameter which defaults to 1.0 that would tell us what percent of the data should be used in the bootstrap and a num_samples parameter (like you have) that tells us how many bootstrapped samples we should use (also a default of 1). This way, in the default case, no bootstrapping is performed.

If a user chooses a low proportion parameter, and the sampling leads to leaving out an action in the reward model that is needed in evaluation, we can just throw a descriptive error which says to increase the proportion parameter to avoid the issue.

I also like your suggestion in 4, I think that would work well too. Perhaps the variance of the bootstrap would be a bit underestimated, but I think it would still be pretty good.

What do you think?

Thanks. That's an interesting idea. I'm not 100% sure I understand though. So basically the definition of the bootstrap in my head is taking each sample from the data with replacement. So under that definition, in the case where the proportion parameter is equal to one and the number of bootstrap samples take is also equal to one, that isn't actually equivalent to the non-bootstrap case. For example if our sample set is {A, B}, then {A, A} and {B, B} are potential output samples.

I could be misunderstanding here though, let me know if I've misinterpreted it.

There is one other potential issue with the proportion solution (that I can think of) and that's that you lose a bit of simplicity in calculating the variance of the reward. Since the samples are smaller, the variance of the reward for each of those samples should be higher, so there would need to be some sort of correction for that.

I agree with you on four. It might underestimate the variance slightly, but as long as users are warned it should still be a useful estimate of how wide the CI around the estimate is. I think right now this one feels like the most solid solution to me too.

Cool, method 4 sounds great to me too.

Let me know if you spot anything in the PR (the first thing that comes to mind is that the confidence interval function returning a function is a little goofy, but that was an aesthetic choice on my end, since I liked how confidence_interval(0.95) looked versus ConfidenceInterval(0.95) ). Very prepared to change that/ anything else based on feedback.

Thanks for putting out a PR! I'll review this early next week.