/LaNPM-Dataset

A robotics mobile manipulation dataset that includes natural language, navigation, manipulation, and perception data for every trajectory.

Primary LanguageJupyter NotebookMIT LicenseMIT

LaNPM Dataset Benchmark

Under Review
Website | arXiv (Coming Soon) | RSS24 Workshop Paper | Model Checkpoints | Dataset | Model Card

Sequential timesteps of images from sim and real collected robot trajectories along with the natural language command describing the task.

As robots that follow natural language become more capable and prevalent, we need a benchmark to holistically develop and evaluate their ability to solve long-horizon mobile manipulation tasks in large, diverse environments. Robots must use visual and language understanding, navigation, and manipulation capabilities to tackle this challenge. Existing datasets do not integrate all these aspects, restricting their efficacy as benchmarks. To address this gap, we present the Language, Navigation, Manipulation, Perception (LaNMP) dataset and demonstrate the benefits of integrating these four capabilities and various modalities. LaNMP comprises 574 trajectories across eight simulated and real-world environments for long-horizon room-to-room pick-and-place tasks specified by natural language. Every trajectory consists of over 20 attributes, including RGB-D images, segmentations, and the poses of the robot body, end-effector, and grasped objects. We fine-tuned and tested two models in simulation and on a physical robot to demonstrate its efficacy in development and evaluation. The models perform suboptimally compared to humans across various metrics, indicating significant room for developing better multimodal mobile manipulation models using our benchmark.

Dataset Format

More detailed dataset information can be found in the dataset card DataCard.md.

Download the dataset from this DropBox.

Code that opens, reads, and displays the dataset contents can be found in this Google Colab notebook.

Sim Dataset

The simulation dataset comes in a single hdf5 file, and has the following hierarchy:

sim_dataset.hdf5/
├── data_11:11:28/
│   ├── folder_0
│   ├── folder_1
│   └── folder_2
├── data_11:14:08/
│   ├── folder_0
│   └── ...
└── ...

Under each folder, there are three main numpy files: depth_<num>, inst_seg_<num>, and rgb_<num>, which correspond to the depth image, segmentation image, and rgb image, respectively.

Under the metadata for each folder, there is a dumped json describing other metadata of each time step. The detailed metadata can be found in the dataset card.

Real Dataset

Similarly, the real dataset also comes in a single hdf5 file, and has the following hierarchy:

real_dataset.hdf5/
└── FloorTrajectories/
    ├── data_00/
    │   ├── folder_10/
    │   │   ├── gripper_depth_10
    │   │   ├── gripper_image_10
    │   │   ├── left_fisheye_depth_10
    │   │   ├── left_fisheye_image_10
    │   │   ├── right_fisheye_depth_10
    │   │   ├── right_fisheye_image_10
    │   │   └── metadata
    │   └── folder_11/
    │       ├── gripper_depth_10
    │       ├── gripper_image_10
    │       └── ...
    ├── data_01/
    │   └── folder_10/
    │       └── ...
    └── ...

Note that the right fisheye is located on the right side of the robot, but points towards the left side. So the right fisheye produces the left half of the image, and the left one produces the right half.

The images have the following sizes:

key shape
gripper_depth_10 (480, 640)
gripper_image_10 (480, 640, 3)
left_fisheye_depth_10 (240, 424)
left_fisheye_image_10 (640, 480, 3)
right_fisheye_depth_10 (240, 424)
right_fisheye_image_10 (640, 480, 3)

The detailed metadata can be found in the dataset card.

Running Data Collection

Simulation (AI2THOR)

  1. cd collect_sim
  2. python install -r sim_reqs.txt
  3. cd custom_ai2thor_lib_code
  4. Move the files to the ai2thor library folder in the virtual environment
  5. Collect data python mani.py --scene "<scene number>" --command "<natural language command>". Use the following keys to move in the simulator:
  • WASD: moving the robot base
  • J/L: rotate the robot left/right
  • I/K: moving the robot head up/down
  • G: grasp
  • R: release
  • Up arrow/down arrow: move robot shoulder up/down
  • 7/4: move end-effector left/right
  • 8/5 move end-effector up/down
  • 9/6 move end-effector forward/backward
  • Q: end collection and save data
  • CTRL+C: restart collection without saving

Real (Spot)

  1. cd collect_real
  2. conda create --name <env> --file spot_env.txt
  3. Create a map using python record_env_graph.py. See this for more details on how to record the map.
  4. Collect data using the map python collect_spot_data.py -u <map folder> -t "<natural language command>"

RT-1

The RT-1 model from the paper "RT-1: Robotics Transformer for Real-World Control at Scale" by Brohan et al. was modified and fine-tuned on LaNMP. This model was trained and run on an NVIDIA 3090 GPU.

A forked implementation of RT1 (Robotic Transformer) originally inspired by the Google Research paper.

This implemenetation of RT-1 was pretrained on the Bridge dataset and further fine-tuned on our LaNMP dataset for evaluation. Please find details of the repository below

Setup Instructions

git clone git@github.com:h2r/LaNPM-Dataset.git
cd models/main_models/rt1
pip install -e .

Overview of files

This repository has 7 critical files/folders whose use cases are described below

  1. main.py: used to pretrain RT-1 on the bridge dataset. Modifying this file to accomodate different datasets requires changing the observation_space and action_space according to the dataset being loaded, as well as changing the dataset keys in rt1_pytorch/tokenizers/action_tokenizer.py. Running this file saves a series of checkpoints and logs losses using weights and biases
  2. main_ft.py: used to finetune RT-1 on the LaNMP dataset. This file has the observation_space and action_space and PyTorch DataLoader already modified to accomodate for the LaNMP dataset finetuning (AI2Thor). Running this file saves a series of checkpoints and logs losses using weights and biases
  3. main_ft_eval.py: used to run RT-1 in inference mode on the LaNMP dataset. This file has the observation_space and action_space and PyTorch DataLoader already modified to accomodate for the LaNMP dataset (AI2Thor). The file iterates/loads all saved checkpoints from finetuning and runs RT-1 on inference mode for the validation dataset on each checkpoint. The script logs the test losses using weights and biases
  4. ai2thor_env.py: contains a Gym environment style class to load and take steps in AI2Thor enivironment. This file is used to generate real-time trajectories based on the action tokens generated by a finetuned RT-1 model (specific for AI2Thor). The main step() function takes/executes the generated action by RT-1 and returns a success message along with information about the environment state e.g. object or agent metadata, which can be saved to capture the trajectory taken by the agent for a given task
  5. rollout_ai2thor.py: interfaces between the finetuned RT-1 model (from a loaded checkpoint after finetuning on LaNMP) and the ai2thor_env.py Gym environment, in order to send observations from the AI2Thor environment to RT-1 and execute proposed action tokens by RT-1 on AI2Thor. Note that this file should not be run on a headless machine since it requires/deploys AI2Thor simulator GUI
  6. rt1_pytorch/rt1_policy.py: contains the RT-1 model implementation in PyTorch. The loss() function performs forward pass of RT-1 for training and act() function performs the forward pass during inference.
  7. lanmp_dataloader/rt1_dataloader.py: contains the DatasetManager class that extracts trajectories from the LaNMP sim_data.hdf5 dataset file. The script automatically separates train and validation subsets according to different splits e.g. k-fold by scene, task wise or for diversity ablation. The DatasetManager also handles tokenizing/detokenizing the raw trajectory data into 256 discrete buckets, whilst also chunking trajectories across non-overlapping window lengths of 6 steps

Details about file arguments

Most relevant files in this repository accept the same set of arguments that are detailed below

  • dataset: only for the main.py file, specifies the dataset on which the RT-1 model should be pretrained
  • train-split: specifies what fraction of the loaded dataset should be used for training v.s. evaluation
  • eval-split: specifies what fraction of the laoded dataset should be used for evaluation v.s. training
  • epochs: total number of passes over the all batches of the training set
  • lr: learning rate for cross-entropy loss of RT1
  • train-batch-size: the number of trajectories from which to sample data for the current training batch
  • eval-batch-size: the number of trajectories from which to sample data for the current evaluation batch
  • trajectory-length: the window size (context history of trajecotry-length previous images) used for each trajectory when feeding data to RT-1 model; this is set to 6 based on the RT-1 implementation
  • sentence-transformer: the language embedding to apply on the language-specified task
  • device: the device to load the model/data onto during training/inference
  • eval-freq: the interval of batches at which to run evaluation/inference on the validation dataset (currently set to 0 in main_ft.py)
  • checkpoint-freq: the interval of batches at which to save a checkpoint during training
  • checkpoint-dir: the directory path at which to save a checkpoint during training
  • load-checkpoint: (optional) path of the pretrained checkpoint to load for further fine-tuning
  • wandb: boolean determining if logging to weights and biases should happen
  • eval-scene: the AI2Thor scene number in the dataset that is held out of the training set for evaluation during k-fold cross validation across scenes
  • split-type: determines the split type (i.e. k-fold by scene, task wise or diversity ablation) between train and evaluation used by the DatasetManager in rt1_dataloader.py
  • num-diversity-scenes: only if split-type is diversity-ablation, this is used to determine the total number of scenes to perform diversity ablation over i.e. maximum of 4 for LaNMP simulation data
  • max-diversity-trajectories: only if split-type is diversity-ablation, this is used to determine the total number of trajectories that are divided evenly across the number of num-diversity-scenes scenes
  • train-subbatch: the batch size to use during training/finetuning
  • eval-subbatch: the batch size to use during evaluation

Checkpoint samples

Please find the follow checkpoints samples that can be loaded to the RT-1 model. These can be found on the supplementary Google Drive associated with this project

  • sample_checkpoints/pretrained_bridge: the final checkpoint saved when pretraining the RT-1 model on the Bridge dataset
  • sample_checkpoints/task_gen: the final checkpoint saved after finetuning RT-1 model on the task-wise split for the task generalization experiment
  • sample_checkpoints/kfold_cross_val: the final checkpoints saved after finetuning RT-1 model using k-fold cross validations where each fold represented a held out scene from AI2Thor

Additional notes

When running any of the finetuning or pretraining scripts, please ensure the following modules are loaded module load cuda/11.8.0-lpttyok module load cudnn/8.7.0.84-11.8-lg2dpd5

Preliminary

  1. Create a Python virtual environment using Python 3.9.16 using python3.9 -m venv rt1_env
  2. Activate the virtual environment using source rt1_env/bin/activate
  3. Install and load the CUDA Toolkit 11.8.0 and cuDNN 8.7.0
  4. cd LaNMP-Dataset/models/main_models/rt1
  5. Load necessary libraries using pip install -e . or directly activate the saved rt1_env folder using source rt1_env/bin/activate (if Python 3.9 is loaded onto your system)

Running Pre-Training

  1. cd LaNMP-Dataset/models/main_models/rt1
  2. Open main.py and modify the load-checkpoint argument to None (since we are pretraining from initialization)
  3. Ensure the checkpoint-dir argument is a known and valid local path (where checkpoints during pretraining will be saved at the checkpoint-freq)
  4. Set all other arguments in `main.py'
  5. Navigate to LaNMP-Dataset/models/main_models/rt1/rt1_pytorch/tokenizers/action_tokenizer.py
  6. Ensure the action_order and action_space in lines 61 and 62 of action_tokenizer.py fetch from bridge_keys defined in line 56
  7. Run python3 main.py with all arguments input as required
  8. Checkpoints for pretraining should be saved chronologically (by step number) in the checkpoint-dir directory

Running Fine-Tuning

  1. cd LaNMP-Dataset/models/main_models/rt1
  2. Open main_ft.py and modify the load-checkpoint argument to the checkpoint path generated from pretraining or the path where the pretrained checkpoint (from Google Drive) is saved
  3. Ensure the checkpoint-dir argument is a known and valid local path (where checkpoints during finetuning will be saved at the checkpoint-freq)
  4. Set all other arguments in main_ft.py' (particularly split-type` defines the type of experiment to be run i.e. k-fold across scenes, task generalization or diversity ablations)
  5. Navigate to LaNMP-Dataset/models/main_models/rt1/rt1_pytorch/tokenizers/action_tokenizer.py
  6. Ensure the action_order and action_space in lines 61 and 62 of action_tokenizer.py fetch from lanmp_keys defined in line 56
  7. Run python3 main_ft.py with all arguments input as required
  8. Checkpoints for pretraining should be saved chronologically (by step number) in the checkpoint-dir directory

Running Inference (on AI2Thor)

  1. cd LaNMP-Dataset/models/main_models/rt1
  2. Open main_ft_eval.py and modify the checkpoint-path argument to the checkpoint path from pretraining, finetuning or one of the pre-saved checkpoints (from Google Drive)
  3. Set all other arguments in main_ft_eval.py' (particularly split-type` defines the type of experiment to be run i.e. k-fold across scenes, task generalization or diversity ablations)
  4. Navigate to LaNMP-Dataset/models/main_models/rt1/rt1_pytorch/tokenizers/action_tokenizer.py
  5. Ensure the action_order and action_space in lines 61 and 62 of action_tokenizer.py fetch from lanmp_keys defined in line 56
  6. Run python3 main_ft_eval.py with all arguments input as required
  7. Evaluation loss logs should be reported on weights and biases as well as printed (mean ± std dev) on the terminal

ALFRED Seq2Seq

The ALFRED Seq2Seq model from the paper "ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks" by Shridhar et al. was modified and fine-tuned on LaNMP. This model was trained and ran on an NVIDIA 3090 GPU, so some of the following instructions assume the use of that GPU.

Preliminary:

  1. Create a Python virtual environment using Python 3.9: python3.9 -m venv alfred-env
  2. Activate the virtual environment source alfred-env/bin/activate
  3. Install and load CUDA Toolkit 11.8 and cuDNN 8.7
  4. cd LaNMP-Dataset/models/main_models
  5. export ALFRED_ROOT=$(pwd)/alfred
  6. cd alfred
  7. Install all dependencies: pip install -r requirements.txt
  8. Download the dataset from the DropBox
  9. Place the zipped dataset files in LaNMP-Dataset/dataset
  10. Unzip the datasets gunzip *.gz

Running training:

The original pretrained model used for fine-tuning can be downloaded from this Google Drive Folder.

  1. Place the model in LaNMP-Dataset/models/main_models/alfred/pretrained
  2. cd LaNMP-Dataset/models/main_models/alfred
  3. Extract the image features using the ResNet and save them to disk:
python models/utils/extract_resnet.py --gpu
  1. Fine-tune:
python models/train/train_seq2seq.py --model seq2seq_im_mask --dout exp/model:{model}_discrete_relative_fold1 --gpu --batch 8 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 --pp_data 'data/feats_discrete_relative_fold1' --split_keys 'data/splits/split_keys_discrete_relative_fold1.json --class_mode --relative --preprocess'
  • --class_mode puts the model into classification mode to use cross-entropy loss and output discrete actions
  • --relative makes the model produce relative (delta between current step and next step) actions rather than global actions
  • --preprocess preprocesses the data and saves it on disk to be used for the training down the pipeline. This only needs to be ran once. It can be removed after the first time to only run the training.
  • More details on all the command-line arguments can be found at LaNMP-Dataset/models/main_models/train/train_seq2seq.py

Running inference:

The simulated fine-tuned models can be downloaded from this Google Drive folder.

The simulated extracted ResNet visual features can be downloaded from this Google Drive folder.

  1. Place the model pth files in LaNMP-Dataset/models/main_models/alfred/exp
  2. Place the zipped vision features file in LaNMP-Dataset/models/main_models/alfred/data/vis_feats
  3. Unzip and extract the file tar -xzvf vis_feats.tar.gz
  4. cd LaNMP-Dataset/models/main_models/alfred
  5. Run inference using fold1's fine-tuned model:
python models/eval/eval_seq2seq.py --model_path exp/best_test_fold1.pth --gpu --model models.model.seq2seq_im_mask --pp_data data/feats_discrete_relative_fold1 --split_keys 'data/splits/split_keys_discrete_relative_fold1.json'
  • The command assumes it is run on a machine with a GUI in order to run the AI2THOR simulator, i.e. not on a headless machine.
  • To run other models instead of the "fold1" model, change any part that has "fold1" in the command to the desired model, e.g. "task" for the "best_test_task.pth" model.
  • More details on all the command-line arguments can be found at LaNMP-Dataset/models/main_models/eval/eval_seq2seq.py.