Code for Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning.
Authors: Shenao Zhang¹, Yaqing Wang², Yinxiao Liu², Tianqi Liu², Peter Grabowski³, Eugene Ie³, Zhaoran Wang¹, Yunxian Li³.
¹Northwestern University, ²Google Deepmind, ³Google.
We introduce a principled RL framework for stitching together plausible strategies, analogous to linearized best-of-N reasoning, but with explicit step-level guidance on when and how LLMs should reflectively explore.
pip install -e .bash train_barl.sh@article{zhang2025beyond,
title={Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning},
author={Zhang, Shenao and Wang, Yaqing and Liu, Yinxiao and Liu, Tianqi and Grabowski, Peter and Ie, Eugene and Wang, Zhaoran and Li, Yunxuan},
journal={arXiv preprint arXiv:2505.20561},
year={2025}
}This repository is built upon the OpenRLHF framework. We thank the authors for their great work.