Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox
CoRL 2022

PerAct is an end-to-end behavior cloning agent that learns to perform a wide variety of language-conditioned manipulation tasks. PerAct uses a Transformer that exploits the 3D structure of voxel patches to learn policies with just a few demonstrations per task.

The best entry-point for understanding PerAct is this Colab Tutorial. If you just want to apply PerAct to your problem, then start with the notebook, otherwise this repo is for mostly reproducing RLBench results from the paper.

For the latest updates, see: peract.github.io


Hotfix 🔥

  • Training Speed-Up and Storage Memory Reduction: Ishika found that switching from fp32 to fp16 for storing pickle files dramatically speeds-up training time and significantly reduces memory usage. Checkout her modifications to YARR here.



PerAct is built-off the ARM repository by James et al. The prerequisites are the same as ARM.

1. Environment

# setup a virtualenv with whichever package manager you prefer
virtualenv -p $(which python3.8) --system-site-packages peract_env  
source peract_env/bin/activate
pip install --upgrade pip

2. PyRep and Coppelia Simulator

Follow instructions from the official PyRep repo; reproduced here for convenience:

PyRep requires version 4.1 of CoppeliaSim. Download:

Once you have downloaded CoppeliaSim, you can pull PyRep from git:

cd <install_dir>
git clone https://github.com/stepjam/PyRep.git
cd PyRep

Add the following to your ~/.bashrc file: (NOTE: the 'EDIT ME' in the first line)


Remember to source your bashrc (source ~/.bashrc) or zshrc (source ~/.zshrc) after this.

Warning: CoppeliaSim might cause conflicts with ROS workspaces.

Finally install the python library:

pip install -r requirements.txt
pip install .

You should be good to go! You could try running one of the examples in the examples/ folder.

If you encounter errors, please use the PyRep issue tracker.

3. RLBench

PerAct uses my RLBench fork.

cd <install_dir>
git clone -b peract https://github.com/MohitShridhar/RLBench.git # note: 'peract' branch

cd RLBench
pip install -r requirements.txt
python setup.py develop

For running in headless mode, tasks setups, and other issues, please refer to the official repo.


PerAct uses my YARR fork.

cd <install_dir>
git clone -b peract https://github.com/MohitShridhar/YARR.git # note: 'peract' branch

pip install -r requirements.txt
python setup.py develop

PerAct Repo


cd <install_dir>
git clone https://github.com/peract/peract.git


cd peract
pip install git+https://github.com/openai/CLIP.git
pip install -r requirements.txt

export PERACT_ROOT=$(pwd)  # mostly used as a reference point for tutorials
python setup.py develop

Note: You might need versions of torch==1.7.1 and torchvision==0.8.2 that are compatible with your CUDA and hardware. Later versions should also be fine (in theory).


A quick tutorial on evaluating a pre-trained multi-task agent.

Download a pre-trained PerAct checkpoint trained with 100 demos per task (18 tasks in total):

sh scripts/quickstart_download.sh

Generate a small val set of 10 episodes for open_drawer inside $PERACT_ROOT/data:

cd <install_dir>/RLBench/tools
python dataset_generator.py --tasks=open_drawer \
                            --save_path=$PERACT_ROOT/data/val \
                            --image_size=128,128 \
                            --renderer=opengl \
                            --episodes_per_task=10 \
                            --processes=1 \

This will take a few minutes to finish.

Evaluate the pre-trained PerAct agent:

CUDA_VISIBLE_DEVICES=0 python eval.py \
    rlbench.tasks=[open_drawer] \
    rlbench.task_name='multi' \
    rlbench.demo_path=$PERACT_ROOT/data/val \
    framework.gpu=0 \
    framework.logdir=$PERACT_ROOT/ckpts/ \
    framework.start_seed=0 \
    framework.eval_envs=1 \
    framework.eval_from_eps_number=0 \
    framework.eval_episodes=10 \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    framework.eval_type='last' \

If you are on a headless machine, turn off the visualization with headless=True.

You can evaluate the same agent on other tasks. First generate a validation dataset like above (or download a pre-generated dataset) and then run eval.py.

Note: The dowloaded checkpoint might not necessarily be the best one for a given task, it's simply the last checkpoint from training.


Pre-Generated Datasets

We provide pre-generated RLBench demonstrations for train (100 episodes), validation (25 episodes), and test (25 episodes) splits used in the paper. If you directly use these datasets, you don't need to run tools/data_generator.py from RLBench. Using these datasets will also help reproducibility since each scene is randomly sampled in data_generator.py.

Is there one big zip file with all splits and tasks instead of individual files? No. My gDrive account will get rate-limited if everyone is directly downloading huge files. I recommend downloading through rclone with Google API Console enabled. The full dataset of zip files is ~116GB.

Pre-Trained Checkpoints

  • ID: seed0
  • Num Tasks: 18
  • Training Demos: 100 episodes per task (each task includes all variations)
  • Training Iterations: 600k
  • Voxel Size: 100x100x100
  • Cameras: front, left_shoulder, right_shoulder, wrist
  • Latents: 2048
  • Self-Attention Layers: 6
  • Voxel Feature Dim: 64
  • Data Augmentation: 45 deg yaw perturbations
  • ID: seed5
  • Num Tasks: 18
  • Training Demos: 100 episodes per task (each task includes all variations)
  • Training Iterations: 600k
  • Voxel Size: 100x100x100
  • Cameras: front, left_shoulder, right_shoulder, wrist
  • Latents: 512
  • Self-Attention Layers: 6
  • Voxel Feature Dim: 64

See quickstart guide on how to evaluate these checkpoints. Make sure framework.start_seed is set to the correct ID.

Data Generation

Data generation is pretty similar to the ARM setup, except you use --all_variations=True to sample all task variations:

cd <install_dir>/RLBench/tools
python dataset_generator.py --tasks=open_drawer \
                            --save_path=$PERACT_ROOT/data/train \
                            --image_size=128,128 \
                            --renderer=opengl \
                            --episodes_per_task=100 \
                            --processes=1 \

You can run these in parallel for multiple tasks. Here is a list of 18 tasks used in the paper (in the same order as results Table 1):


You can probably train PerAct on more RLBench tasks. These 18 tasks were hand-selected for their diversity in task variations and language instructions.

Warning: Each scene generated with data_generator.py will use a different random seed to configure objects and states in the scene. This means you will get very different train, val, and test sets to the pre-generated ones. This should be fine for PerAct, but you will likely see small differences in evaluation performances. It's recommended to use the pre-generated datasets for reproducibility. Using larger test sets will also help.

Training and Evaluation

The following is a guide for training everything from scratch. All tasks follow a 4-phase workflow:

  1. Generate train, val, test datasets with data_generator.py or download pre-generated datasets.
  2. Train agent with train.py and save 10K iteration checkpoints.
  3. Run validation with eval.py with framework.eval_type=missing to find the best checkpoint on val tasks and save results in eval_data.csv.
  4. Evaluate the best checkpoint in eval_data.csv on test tasks with eval.py and framework.eval_type=best. Save final results to test_data.csv.

Make sure you have a train, val, and test set with sufficient demos for the tasks you want to train and evaluate on.


Train a PERACT_BC agent with 100 demos per task for 600K iterations with 8 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py \
    method=PERACT_BC \
    rlbench.tasks=[close_jar,insert_onto_square_peg,light_bulb_in,meat_off_grill,open_drawer,place_cups,place_shape_in_shape_sorter,push_buttons,put_groceries_in_cupboard,put_item_in_drawer,put_money_in_safe,reach_and_drag,stack_blocks,stack_cups,turn_tap,place_wine_at_rack_location,slide_block_to_color_target,sweep_to_dustpan_of_size] \
    rlbench.task_name='multi_18T' \
    rlbench.cameras=[front,left_shoulder,right_shoulder,wrist] \
    rlbench.demos=100 \
    rlbench.demo_path=$PERACT_ROOT/data/train \
    replay.batch_size=1 \
    replay.path=/tmp/replay \
    replay.max_parallel_processes=32 \
    method.voxel_sizes=[100] \
    method.voxel_patch_size=5 \
    method.voxel_patch_stride=5 \
    method.num_latents=2048 \
    method.transform_augmentation.apply_se3=True \
    method.transform_augmentation.aug_rpy=[0.0,0.0,45.0] \
    method.pos_encoding_with_lang=True \
    framework.training_iterations=600000 \
    framework.num_weights_to_keep=60 \
    framework.start_seed=0 \
    framework.log_freq=1000 \
    framework.save_freq=10000 \
    framework.logdir=$PERACT_ROOT/logs/ \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \

Make sure there is enough disk-space for replay.path and framework.logdir. Adjust replay.max_parallel_processes to fill the replay buffer in parallel based on your resources. You can also train on fewer GPUs, but training will take a long time to converge.

To get started, you should probably train on a small number of rlbench.tasks.

Use tensorboard to monitor training progress with logs inside framework.logdir.


Evaluate PERACT_BC seed0 on 18 val tasks sequentially (slow!):

CUDA_VISIBLE_DEVICES=0 python eval.py \
    rlbench.tasks=[close_jar,insert_onto_square_peg,light_bulb_in,meat_off_grill,open_drawer,place_cups,place_shape_in_shape_sorter,push_buttons,put_groceries_in_cupboard,put_item_in_drawer,put_money_in_safe,reach_and_drag,stack_blocks,stack_cups,turn_tap,place_wine_at_rack_location,slide_block_to_color_target,sweep_to_dustpan_of_size] \
    rlbench.task_name='multi_18T' \
    rlbench.demo_path=$PERACT_ROOT/data/val \
    framework.logdir=$PERACT_ROOT/logs/ \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    framework.eval_envs=4 \
    framework.start_seed=0 \
    framework.eval_from_eps_number=0 \
    framework.eval_episodes=25 \
    framework.eval_type='missing' \

This script will slowly go through each 10K interval checkpoint and save success rates in eval_data.csv. To evaluate checkpoints in parallel use framework.eval_envs to start multiple processes.


CUDA_VISIBLE_DEVICES=0 python eval.py \
    rlbench.tasks=[close_jar,insert_onto_square_peg,light_bulb_in,meat_off_grill,open_drawer,place_cups,place_shape_in_shape_sorter,push_buttons,put_groceries_in_cupboard,put_item_in_drawer,put_money_in_safe,reach_and_drag,stack_blocks,stack_cups,turn_tap,place_wine_at_rack_location,slide_block_to_color_target,sweep_to_dustpan_of_size] \
    rlbench.task_name='multi_18T' \
    rlbench.demo_path=$PERACT_ROOT/data/test \
    framework.logdir=$PERACT_ROOT/logs/ \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    framework.eval_envs=1 \
    framework.start_seed=0 \
    framework.eval_from_eps_number=0 \
    framework.eval_episodes=25 \
    framework.eval_type='best' \

The final results will be saved in test_data.csv.

Baselines and Ablations

All agents reported in the paper are here along with their respective config files:

Code Name Paper Name

PerAct ablations are set with:

method.no_skip_connection: False
method.no_perceiver: False
method.no_language: False
method.keypoint_method: 'heuristic'


OpenGL Errors

GL errors are probably being caused by the PyRender voxel visualizer. See this issue for reference. You might have to set the following environment variables depending on your setup:

export DISPLAY=:0

Unpickling Error

If you see _pickle.UnpicklingError: invalid load key, '\x9e', maybe one of the replay pickle files got corrupted when quitting the training script. Try deleting files in replay.path and restarting training.

Recording Videos

To save high-resolution videos of agent executions, set cinematic_recorder.enabled=True with eval.py:

CUDA_VISIBLE_DEVICES=0 python eval.py \
    rlbench.tasks=[open_drawer] \
    rlbench.task_name='multi' \
    rlbench.demo_path=$PERACT_ROOT/data/val \
    framework.gpu=0 \
    framework.logdir=$PERACT_ROOT/ckpts/ \
    framework.start_seed=0 \
    framework.eval_envs=1 \
    framework.eval_from_eps_number=0 \
    framework.eval_episodes=3 \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    framework.eval_type='last' \
    rlbench.headless=True \

Videos will be saved at $PERACT_ROOT/ckpts/multi/PERACT_BC/seed0/videos/open_drawer_w600000_s0_succ.mp4.

Note: Rendering at high-resolutions is super slow and will take a long time to finish.

Disclaimers and Limitations

  • Code quality level: Desperate grad student.
  • Why isn't your code more modular?: My code, like this project, is end-to-end.
  • Small test set: The test set should be larger than just 25 episodes. If you parallelize the evaluation, you can easily evaluate on larger test sets and do multiple runs with different seeds.
  • Parallelization: A lot of things (data generation, evaluation) are slow because everything is done serially. Parallelizing these processes will save you a lot of time.
  • Impossible tasks: Some tasks like push_buttons are not solvable by PerAct since it doesn't have any memory.
  • Switch from DP to DDP: For the paper submission, I was using PyTorch DataParallel for multi-gpu training. For this code release, I switched to DistributedDataParallel. Hopefully, I didn't introduce any new bugs.
  • Collision avoidance: All simulated evaluations use V-REP's internal motion-planner with collision avoidance. For real-world experiments, you have to setup MoveIt to use the voxel grid for avoiding occupied voxels.
  • YARR Modifications: My changes to the YARR repo are a total mess. Sorry :(
  • LAMB Optimizer: The LAMB implementation has some issues but still works 🤷. Maybe use FusedLAMB instead.
  • Other limitations: See Appendix L of the paper for more details.


How much training data do I need for real-world tasks?

It depends on the complexity of the task. With 10-20 demonstrations the agent should start to do something useful, but it will often make mistakes by picking the wrong object. For robustness you probably need 50-100 demostrations. A good way to gauge how much data you might need is to setup a simulated version of the problem and evaluate agents trained with 10, 100, 250 demonstrations.

How long should I train the agent for? When will I start seeing good evaluation performance?

This depends on the number, complexity, and diversity of tasks, and also how much compute you have. Take a look at this checkpoint folder containing train_data.csv, eval_data.csv and test_data.csv. These log files should give you a sense of what the training losses look like and what evaluation performances to expect. All multi-task agents in the paper were trained for 600K iterations, and single-task agents were trained for 40K iterations, all with 8-GPU setups.

Why doesn't the agent follow my language instruction?

This means either there is some sort of bias in the dataset that the agent is exploiting (e.g. always 'blue blocks'), or you don't have enough training data. Also make sure that the task is doable - if a referred attribute is barely legible in the voxel grid, then it's going to be hard for agent to figure out what you mean.

How to pick the best checkpoint for real-robot tasks?

Ideally, you should create a validation set with heldout instances and then choose the checkpoint with the lowest translation and rotation errors. You can also reuse the training instances but swap the language instructions with unseen goals. But all real-world experiments in the paper simply chose the last checkpoint.

Can you replace the motion-planner with a learnable module?

Yes, see C2FARM+LPR by James et al.

Why do I need to generate a val and test set?

Two reasons: (1) One-to-one comparisons between two agents. We can take an episode from the test dataset, and use its random seed to spawn the exact same objects and object pose configurations every time. (2) Checking if the task is actually solvable, at least by an expert. We don't want to evaluate on unsolvable task instances. See issue3 for reference.

Why are duplicate keyframes loaded into the replay buffer?

This is a design choice in ARM (by James et al). I am guessing the keyframes get added several times because they indicate important "phase transitions" between trajectory bottlenecks, and having several copies makes them more likely to be sampled. See issue6.

The training is too slow and the replay pickle files take up too much space. What should I do about this?

Ishika found that switching from fp32 to fp16 for storing pickle files dramatically speeds-up training time and significantly reduces memory usage. Checkout her modifications to YARR here.

Will you release your real-robot code for data-collection and execution?

Checkout franka_htc_teleop.zip for real-robot code. peract_demo_interface.py is for collecting data, and peract_agent_interface.py is for executing trained models. The real-robot datasets are here. See issue18 for more details on the setup, and issue2 for real-world setup details.

Docker Guide

Coming soon...


  • Colab Tutorial: This tutorial is a good starting point for understanding the data-loading and training pipeline.
  • Dataset Visualizer: Coming soon ... see Colab for now.
  • Q-Prediction Visualizer: Coming soon ... see Colab for now.
  • Results Notebook: Coming soon ...

Hardware Requirements

PerAct agents for the paper were trained with 8 P100 cards with 16GB of memory each. You can use fewer GPUs, but training will take a long time to converge.

Tested with:

  • GPU - NVIDIA P100
  • CPU - Intel Xeon (Quad Core)
  • RAM - 32GB
  • OS - Ubuntu 16.04, 18.04

For inference, a single GPU is sufficient.


This repository uses code from the following open-source projects:


Original: https://github.com/stepjam/ARM
License: ARM License
Changes: Data loading was modified for PerAct. Voxelization code was modified for DDP training.


Original: https://github.com/lucidrains/perceiver-pytorch
License: MIT
Changes: PerceiverIO adapted for 6-DoF manipulation.


Original: https://github.com/lucidrains/vit-pytorch
License: MIT
Changes: ViT adapted for baseline.

LAMB Optimizer

Original: https://github.com/cybertronai/pytorch-lamb
License: MIT
Changes: None.


Original: https://github.com/openai/CLIP
License: MIT
Changes: Minor modifications to extract token and sentence features.

Thanks for open-sourcing!


Questions or Issues?

Please file an issue with the issue tracker.