facebookresearch/ReAgent

calculation and meaning of CPE - DM/DR

Jiawen-Yan opened this issue · 6 comments

Hi, I have two questions about CPE estimators.

  1. When deriving those estimators, logged reward is used in EvaluationDataPage file. How does the logged rewards calculated? Is it calculated by discounting future expected return by gamma set in config file?

  2. How should we interpret the meaning of normalized estimators? For example, if we get an estimator of 2, does it mean we are 2 times better than before?

Thanks a lot.

  1. logged rewards are just one-step rewards which come from the data. You probably mean logged values? Logged values are calculated in EvaluationDataPage.compute_values. Logged values = sum of discounted logged rewards and the discount factor is indeed from gamma set in the config file.

  2. Yes

For the first question, If the logged rewards are one-step rewards, the normalizer is also one step. Then at line 250 of doubly_robust_estimator.py, why are we multiplying the discounted direct_method_score with a one-step based normalizer?
Screen Shot 2020-01-02 at 11 55 58 AM

For the second questions, I have two additional rather strange observations

  1. If I increase the training epoch, the variance of CPE estimates suddenly increase after some point, like the screenshot exhibits below. What may cause this happen?
    Screen Shot 2020-01-02 at 11 34 49 AM

  2. the reward loss can have very large variance and fluctuate a lot around some value. Is that normal? In our setting, the reward is monetary value, 80% of which is 0 and 20% is not (can be as large as 10000 and as small as 10, but mostly around 100). Log transformation of reward does not help much here.
    Screen Shot 2020-01-02 at 2 10 32 PM

Direct_method_score is also computed based on one-step information. Line 239 shows direct_method_score uses model_rewards, which is the predicted one-step reward by the direct-method model.

Question 2 is quite context-dependent. There could be tons of reasons. Point 1 could be due to a too large learning rate which causes the policy change a lot across mini-batches. Point 2, the reward loss is actually quite stable around 0.86?

So, if I'm understanding this correctly, all direct method, IPS and doubly robust estimate are one-step value as indicated in line 290 of doubly_robust_estimator.py; while sequential doubly robust estimator and weighted sequential doubly robust estimator are multi-step discounted value, as indicated in sequential_doubly_robust_estimator.py, and weighted_sequential_doubly_robust_estimator.py. Is this correct? Then, how about magic estimator?

Thanks a lot.

One additional question, if the loss has converged as the graph shown above, why does the CPE values may suddenly fluctuate a lot?

So, if I'm understanding this correctly, all direct method, IPS and doubly robust estimate are one-step value as indicated in line 290 of doubly_robust_estimator.py; while sequential doubly robust estimator and weighted sequential doubly robust estimator are multi-step discounted value, as indicated in sequential_doubly_robust_estimator.py, and weighted_sequential_doubly_robust_estimator.py. Is this correct? Then, how about magic estimator?
Yes

One additional question, if the loss has converged as the graph shown above, why does the CPE values may suddenly fluctuate a lot?
There could be a lot of reasons. The first that came to my mind is that as training goes by, loss converges while the policy may still change a lot. This could happen if multiple actions' q-values are close to each other.