Policy Adaptation from Foundation Model Feedback

Paper | Project Page

In this work, we propose Policy Adaptation from Foundation model Feedback (PAFF). When deploying the trained policy to a new task or a new environment, we first let the policy play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to provide feedback by relabeling the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer.

Dependencies and Installation

Python >= 3.8 (Recommend to use Anaconda)
PyTorch >= 1.7
NVIDIA GPU + CUDA

Installation

Clone repo

git clone https://github.com/geyuying/PAFF_code
cd PAFF_code

Install dependent packages
```
pip install -r requirements.txt
```

Data Preparation

In this repo, we provide the training code for adapting the policy trained on CALVIN Env A/B/C to Env D. Please refer to HULC for downloading CALVIN Dataset.

Training

Stage-1: Train your own policy on Env A/B/C of CALVIN dataset. In this repo, we follow HULC for training the policy, but adopt the pre-trained MDETR as the visual and language encoder.

Stage-2: Make the policy trained in the first stage ''play'' with a series of randomly generated language instructions in Env D of CALVIN dataset. We record these demonstrations including the visual observations and the robot’s actions by the trained policy.

cd Play
python hulc/evaluation/evaluate_policy_record.py  --dataset_path hulc/dataset/task_ABC_D --train_folder your_trained_policy_folder --last_k_checkpoints 1

Stage-3: Fine-tune CLIP for the ability to relabel the recorded demonstrations through reasoning about sequential visual observations with Spatio-Temporal Adapter (ST-Adapter) on Env A/B/C of CALVIN dataset.

cd CLIP_Finetune
python hulc/training.py datamodule.root_data_dir=hulc/dataset/task_ABC_D ~callbacks/rollout ~callbacks/rollout_lh

Stage-4: Use the fine-tuned CLIP in the third stage to relabel the recorded demonstrations through retrieving a language instruction among all possible language instructions.

cd Relabel
python hulc/training.py datamodule.root_data_dir=hulc/record_D ~callbacks/rollout ~callbacks/rollout_lh

Stage-5: Fine-tuned the trained policy in the first stage on collected demonstration-instruction data.

cd Policy_Finetune
python hulc/training.py datamodule.root_data_dir=hulc/record_D_after_relabel  ~callbacks/rollout ~callbacks/rollout_lh

Acknowledgement

Our code is based on the implementation of "What Matters in Language Conditioned Imitation Learning over Unstructured Data" https://github.com/lukashermann/hulc.

Claim

In this repo, the training code has not been meticulously polished and organized 😳. Hope that it can provide you with some inspiration 😅.

Citation

If our code is helpful to your work, please cite:

@inproceedings{ge2023policy,
  title={Policy adaptation from foundation model feedback},
  author={Ge, Yuying and Macaluso, Annabella and Li, Li Erran and Luo, Ping and Wang, Xiaolong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={19059--19069},
  year={2023}
}