
Primary LanguagePython

Explore and Tell: Embodied Visual Captioning in 3D Environments (ICCV 2023)

Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin


  • refer to kubric for preparing basic experimental environment by 'docker pull kubricdockerhub/kubruntu'. (Note this docker image may not support GPU, refer to google-research/kubric#224, you can try this docker image 'docker pull ganow/kubric:latest' for GPU supporting.)

  • python 3.9.5, torch 1.12, mmcv 1.6.0

ET-CAP Download

Download ETCAP from


  • 3D assets (zip ~64G, unzip ~142G): 3D assets used in ET-Cap come from ShapeNet and GSO, before donwloading 3D assets, please make sure you have accepted the license from shapenet.org.
    cd ETCAP
    unzip kubric_asstets.zip
  • 3D scenes: zip ~165G, unzip ~300G
    cat scenes_split* > scenes.zip
    unzip scenes.zip
    unzip scenes_redo.zip
  • ET-Cap annotations: ~558MB, including:
    • only navigation data: navigation_v1_{split}_{subset}.json


      list of {
          'scene_id': string, 
          'naivgation_data': list of {
              'pathid': string, 
              'render_dir': string, 
              'images': list of rendered image names, 'actions': list, 
              'path': list of position ids (int),
              'positions': list of grid-level position coordinates, 
              # after skip some points in the path
              'shorter_images': list of rendered image names,
              'shorter_actions': list,
              'shorter_path': list of grid-level position coordinates,
              'shorter_positions': list of grid-level position coordinates,
    • both navigation and caption data: navicaption_v1_{split}_{subset}.json


      list of {
          'scene_id': string, 
          'naivgation_data': list of {
              # compared 'naivgation_data' above, add  'final_view_captions'
              'final_view_captions': list of captions of the trajectory
          'scene_captions': list of captions of the scene

revise the {DATASET_DIR} to you own dataset location in following files:

  • ./ET-Cap/CaBOT/mmdetection/mmdet/datasets/embodiedcap.py
  • ./ET-Cap/CaBOT/mmdetection/mmdet/models/navigators/kubric_render.py
  • ./ET-Cap/CaBOT/mmdetection/mmdet/models/navigators/single_navigator.py
  • ./ET-Cap/dataset_construct/path_action_construct.py
  • ./ET-Cap/dataset_construct/scene_construct.py

revise the {data_root} to you own dataset location in config files in

  • ./ET-Cap/CaBOT/mmdetection/configs/single_captioner
  • ./ET-Cap/CaBOT/mmdetection/configs/single_navigator

Create Your Own Scenes or Training Trajectories

3D Scenes Simulation

python scene_construct.py

Trajectory Generation

python path_action_construction.py


The CaBOT code is organized based on mmdetection. The checkpoint of CaBOT (including the Navigator and the Captioner, ~7G) can be downloaded from:

Put models under ./ET-Cap/CaBOT/mmdetection/tools/work_dirs

cd ./mmdetection/tools

History-aware Navigator

Train (refer to navigator_train.sh)

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.run --nproc_per_node=1 --master_port 1010 embodied_nav_train.py \
../configs/single_navigator/navigator_region2layer_time1layer_waction.py \
--work-dir {navigator_save_dir} \
--launcher pytorch

Evaluate on validation/test set (refer to navigator_dist_test.sh)

## test history-aware navigator

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 --master_port 1210 embodied_nav_test.py \
../configs/single_navigator/$config \
./work_dirs/$model_dir/$model_name \
--work-dir ./work_dirs/$model_dir/online_metrics/ \
--eval_mode test \
--eval_render_type simulate_render \
--render_save_dir $render_dir \
--eval_set anno/navigation_v1_val.json \
--calculate_metrics_with_saved_result False \
--path_len_weight_eval True \
--launcher pytorch

Trajectory-aware Captioner

Train (refer to captioner_train.sh)

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.run --nproc_per_node=1 --master_port 1400 embodied_trajcap_train.py \
../configs/single_captioner/captioner_timeglobal_meanview_detrinit.py \
--work-dir {captioner_save_dir} \
--launcher pytorch

Captioning Evaluate (with oracle trajectories) on validation/test set (refer to captioner_dist_test.sh)


CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 --master_port 1111 embodied_trajcap_test.py \
../configs/single_captioner/$config \
./work_dirs/$model_dir/latest.pth \
--use_val_best_checkpoint True \
--eval_set anno/navicaption_v1_val.json \
--eval CIDEr \
--calculate_metrics_with_saved_result False \
--launcher pytorch

Captioning Evaluate (with predicted trajectories) on validation/test set (refer to captioner_navinfer_dist_test.sh)

  • After test the Navigator on the val/test set, run the following script to transfer navigation results to the input format of the Captioner
# revise the navi_model_dir first
python navi_result_format_transfer.py
  • Test the Captioner with trajectories given by the Navigator
# before running the following script, 
# revise navi_model_dir in the corresponding config file (suffixed with '_navinfer.py'). 

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 --master_port 1000 embodied_trajcap_test.py \
../configs/single_captioner/$config \
./work_dirs/$model_dir/latest.pth \
--use_val_best_checkpoint True \
--eval_set navigation_v1_trajcapinfer_val.json \
--eval CIDEr \
--calculate_metrics_with_saved_result False \
--path_len_weight_eval True \
--launcher pytorch

Spice Calculation

The raw Spice in pycocoevalcap is not suitable for paragraph evaluation. To calculate Spice for this task, we build a ParagraphSpice project (~2.25G). Download it from:

# revise caption_model, pred_file and gt_file, then
python embodied_caption_spice_eval.py


