This is an implementation of the preference-based active learning algorithm for contextual bandit outlined in Contextual Bandits and Imitation Learning via Preference-Based Active Queries. This paper considers the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Under the assumption that the learner has access to a function class that can represent the expert's preference model under appropriate link functions, the paper proposed an algorithm that leverages an online regression oracle with respect to this function class for choosing its actions and deciding when to query.
git clone https://github.com/Cornell-RL/active_CB.git
cd active_cb
pip install numpy pandas torch ucimlrepo
python algo.py
This will run the preference learning algorithm on the Iris dataset. To run the reward learning algorithm or start with a different dataset, please follow
usage: algo.py [-h] [--dataset DATASET] [--query QUERY] [--model MODEL]
optional arguments:
-h, --help show this help message and exit
--dataset DATASET Name of the dataset (iris/car/knowledge)
--query QUERY Query type (active/passive)
--model MODEL Model type (reward/preference)
Running 1000 training iterations on the Iris dataset takes roughly three hours with evaluation. It is expected that running the algorithm on multi-class classification datasets with a large number of classes will take more episodes to converge and will take require a longer runtime.
Here are the results on the Iris, Car Evaluation, and User Knowledge Modeling datasets. The hyperparameters required by the algorithm are set in the training loop based on the dataset.
@misc{sekhari2023contextual,
title={Contextual Bandits and Imitation Learning via Preference-Based Active Queries},
author={Ayush Sekhari and Karthik Sridharan and Wen Sun and Runzhe Wu},
year={2023},
eprint={2307.12926},
archivePrefix={arXiv},
primaryClass={cs.LG}
}