GROOVE is the official implementation of the following publications:
- Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design, NeurIPS 2023 [ArXiv | NeurIPS | Twitter]
- Learned Policy Gradient (LPG),
- Prioritized Level Replay (PLR),
- General RL Algorithms Obtained Via Environment Design (GROOVE),
- Grid-World environment from the LPG paper.
- Discovering Temporally-Aware Reinforcement Learning Algorithms, ICLR 2024 [ArXiv]
- Temporally-Aware LPG (TA-LPG),
- Evolutionary Strategies (ES) with antithetic task sampling.
All scripts are JIT-compiled end-to-end and make extensive use of JAX-based parallelization, enabling meta-training in under 3 hours on a single GPU!
Setup | Running experiments | Citation
All requirements are found in setup/
, with requirements-base.txt
containing the majority of packages, requirements-cpu.txt
containing CPU packages, and requirements-gpu.txt
containing GPU packages.
Some key packages include:
- RL Environments:
gymnax
- Neural Networks:
flax
- Optimization:
optax
,evosax
- Logging:
wandb
pip install $(cat setup/requirements-base.txt setup/requirements-cpu.txt)
- Build docker image
cd setup/docker & ./build_gpu.sh & cd ../..
- (To enable WandB logging) Add your account key to
setup/wandb_key
:
echo [KEY] > setup/wandb_key
Meta-training is executed with python3 train.py
, with all arguments found in experiments/parse_args.py
.
--log --wandb_entity [entity] --wandb_project [project]
enables logging to WandB.--num_agents [agents]
sets the meta-training batch size.--num_mini_batches [mini_batches]
computes each update in sequential mini-batches, in order to execute large batches with little memory. RECOMMENDED: lower this to the smallest value that fits in memory.--debug
disables JIT compilation.
To execute CPU or GPU docker containers, run the relevant script (with the GPU index as the first argument for the GPU script).
./run_gpu.sh [GPU id] python3 train.py [args]
- LPG:
python3 train.py --num_agents 512 --num_mini_batches 16 --log --wandb_entity [entity] --wandb_project [project]
- GROOVE: LPG with
--score_function alg_regret
- TA-LPG: LPG with
--num_mini_batches 8 --use_es --lifetime_conditioning
If you use this implementation in your work, please cite us with the following:
@inproceedings{jackson2023discovering,
author={Jackson, Matthew Thomas and Jiang, Minqi and Parker-Holder, Jack and Vuorio, Risto and Lu, Chris and Farquhar, Gregory and Whiteson, Shimon and Foerster, Jakob Nicolaus},
booktitle = {Advances in Neural Information Processing Systems},
title = {Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design},
volume = {36},
year = {2023}
}
- Meta-testing script for checkpointed models.
- Alternative UED metrics (PVL, MaxMC).