==============================
Source code for running the experiments for the RecSys 2024 full paper titled, "Optimal Baseline Corrections for Off-Policy Contextual Bandits".
The codebase for reproducing the results from the submission builds on top of the Open Bandit Pipeline (https://github.com/st-tech/zr-obp).
We have included a Jupyter notebook (Paper Plots.ipynb) along with the temp. results files from different runs of off-policy learning and evaluation experiments to reproduce the figures (Figure 1,2,3,4) from the paper.
We have included additional results from the Section 5.4 with empirical variance of different estimators for off-policy evaluation task (analogous to Figure 4 from the paper).
The results are present in the file "appendix/Evaluation_variance.pdf" in the current folder.
- Create a new conda environment and install the dependencies for the project via the requirements.txt file
conda create --name recsys_2024
pip3 install -r requirements.txt
- In the script:
./examples/obd/evaluate_off_policy_synt_slurm.py
modify "base_path" and "log_path" variables with the absolute path to the parent folder of the code.
- For off-policy evaluation with an inverse temperature (beta) parameter of 1 for the softmax behavior policy with logged data size 1000000, run the following command:
python examples/obd/evaluate_off_policy_estimators_synt_slurm.py --iteration 1 --N 1000000 --beta 1
- For OPL with mini-batch setup, we need to tune BanditNet first. To tune BanditNet with a batch size of 1024, learning rate of 0.01 and with 250K logged data size, run the following script:
python examples/opl/tune_banditnet.py --n_rounds 250000 --optimizer adam --batch_size 1024 --learning_rate 0.01
- Launch the following script to run all models
python examples/opl/evaluate_off_policy_learners.py --n_rounds 250000 --optimizer sgd --batch_size 1024 --learning_rate 0.1 --random_state 12345
- Run the following command:
python examples/opl/evaluate_off_policy_learners_fullbatch.py --n_rounds 250000 --optimizer adam --batch_size 128 --learning_rate 0.1 --random_state 12345 --epoch 500
For OPE, update the paths in the ope.job file and launch the script, and for OPL experiments, first run 1) tune_banditnet.job, followed by: 2) opl.job (mini-batch), opl_fullbatch.job (full-batch).
If you use our code in your research, please remember to cite our work:
@inproceedings{gupta-2024-optimal,
author = {Gupta, Shashank and Jeunen, Olivier and Oosterhuis, Harrie and de Rijke, Maarten},
booktitle = {RecSys 2024: 18th ACM Conference on Recommender Systems},
date-added = {2024-07-22 16:36:49 +0200},
date-modified = {2024-07-23 01:51:19 +0200},
month = {October},
publisher = {ACM},
title = {Optimal Baseline Corrections for Off-Policy Contextual Bandits},
year = {2024}}