
Code for the paper "Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning"

Primary LanguageJupyter Notebook

Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning

Code for paper Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning. Please refer to our project page for more demonstrations and up-to-date related resources.


  • 2023-10-09: We released our code.
  • 2023-09-20: We release the paper and website of text2reward.


To establish the environment, run this code in the shell:

# set up conda
conda create -n text2reward python=3.7
conda activate text2reward
# set up ManiSkill2 environment
cd ManiSkill2
pip install -e .
pip install stable-baselines3==1.8.0 wandb tensorboard
cd ..
cd run_maniskill
bash download_data.sh
# set up MetaWorld environment
cd ..
cd Metaworld
pip install -e .
# set up code generation
pip install langchain chromadb==0.4.0


  1. If you have not installed mujoco yet, please follow the instructions from here to install it. After that, please try the following commands to confirm the successful installation:
$ python3
>>> import mujoco_py
  1. If you encounter the following errors when running ManiSkill2, we refer you to read the documents here.
    • RuntimeError: vk::Instance::enumeratePhysicalDevices: ErrorInitializationFailed
    • Some required Vulkan extension is not present. You may not use the renderer to render, however, CPU resources will be still available.
    • Segmentation fault (core dumped)



To reimplement our experiment results, you can run the following scripts:


bash run_oracle.sh
bash run_zero_shot.sh
bash run_few_shot.sh

It's normal to encounter the following warnings:

[svulkan2] [error] GLFW error: X11: The DISPLAY environment variable is missing
[svulkan2] [warning] Continue without GLFW.


bash run_oracle.sh
bash run_zero_shot.sh

Generate new reward code

Firstly please add the following environment variable to your .bashrc (or .zshrc, etc.).

export PYTHONPATH=$PYTHONPATH:~/path/to/text2reward

Then navigate to the directory text2reward/code_generation/single_flow and run the following scripts:

# generate reward code for Maniskill
bash run_maniskill_zeroshot.sh
bash run_maniskill_fewshot.sh
# generate reward code for MetaWorld
bash run_metaworld_zeroshot.sh

Run new experiment

By default, the run_oracle.sh script above uses the expert-written rewards provided by the environment; the run_zero_shot.sh and run_few_shot.sh scripts use the generated rewards used in our experiments. If you want to run a new experiment based on the reward you provide, just follow the bash script above and modify the --reward_path parameter to the path of your own reward.


If you find our work helpful, please cite us:

  title={Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning},
  author={Xie, Tianbao and Zhao, Siheng and Wu, Chen Henry and Liu, Yitao and Luo, Qian and Zhong, Victor and Yang, Yanchao and Yu, Tao},
  journal={arXiv preprint arXiv:2309.11489},
