Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin
-
refer to kubric for preparing basic experimental environment by 'docker pull kubricdockerhub/kubruntu'. (Note this docker image may not support GPU, refer to google-research/kubric#224, you can try this docker image 'docker pull ganow/kubric:latest' for GPU supporting.)
-
python 3.9.5, torch 1.12, mmcv 1.6.0
Download ETCAP from
- Baidu Cloud Driver (pwd: rsuh): asssets+scenes+annotation
- OneDriver: annotation (assets and scenes are comming soon...)
including:
- 3D assets (zip ~64G, unzip ~142G): 3D assets used in ET-Cap come from ShapeNet and GSO, before donwloading 3D assets, please make sure you have accepted the license from shapenet.org.
cd ETCAP unzip kubric_asstets.zip
- 3D scenes: zip ~165G, unzip ~300G
cat scenes_split* > scenes.zip unzip scenes.zip unzip scenes_redo.zip
- ET-Cap annotations: ~558MB, including:
-
only navigation data: navigation_v1_{split}_{subset}.json
format:
list of { 'scene_id': string, 'naivgation_data': list of { 'pathid': string, 'render_dir': string, 'images': list of rendered image names, 'actions': list, 'path': list of position ids (int), 'positions': list of grid-level position coordinates, # after skip some points in the path 'shorter_images': list of rendered image names, 'shorter_actions': list, 'shorter_path': list of grid-level position coordinates, 'shorter_positions': list of grid-level position coordinates, } }
-
both navigation and caption data: navicaption_v1_{split}_{subset}.json
format:
list of { 'scene_id': string, 'naivgation_data': list of { # compared 'naivgation_data' above, add 'final_view_captions' 'final_view_captions': list of captions of the trajectory }, 'scene_captions': list of captions of the scene }
-
revise the {DATASET_DIR} to you own dataset location in following files:
- ./ET-Cap/CaBOT/mmdetection/mmdet/datasets/embodiedcap.py
- ./ET-Cap/CaBOT/mmdetection/mmdet/models/navigators/kubric_render.py
- ./ET-Cap/CaBOT/mmdetection/mmdet/models/navigators/single_navigator.py
- ./ET-Cap/dataset_construct/path_action_construct.py
- ./ET-Cap/dataset_construct/scene_construct.py
revise the {data_root} to you own dataset location in config files in
- ./ET-Cap/CaBOT/mmdetection/configs/single_captioner
- ./ET-Cap/CaBOT/mmdetection/configs/single_navigator
python scene_construct.py
python path_action_construction.py
The CaBOT code is organized based on mmdetection. The checkpoint of CaBOT (including the Navigator and the Captioner, ~7G) can be downloaded from:
- Baidu Cloud Driver (pwd:i5xi)
- OneDriver
Put models under ./ET-Cap/CaBOT/mmdetection/tools/work_dirs
cd ./mmdetection/tools
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.run --nproc_per_node=1 --master_port 1010 embodied_nav_train.py \
../configs/single_navigator/navigator_region2layer_time1layer_waction.py \
--work-dir {navigator_save_dir} \
--launcher pytorch
## test history-aware navigator
model_dir={navigator_save_dir}
config=navigator_region2layer_time1layer_waction.py
model_name=latest.pth
render_dir={render_save_dir_name}
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 --master_port 1210 embodied_nav_test.py \
../configs/single_navigator/$config \
./work_dirs/$model_dir/$model_name \
--work-dir ./work_dirs/$model_dir/online_metrics/ \
--eval_mode test \
--eval_render_type simulate_render \
--render_save_dir $render_dir \
--eval_set anno/navigation_v1_val.json \
--calculate_metrics_with_saved_result False \
--path_len_weight_eval True \
--launcher pytorch
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.run --nproc_per_node=1 --master_port 1400 embodied_trajcap_train.py \
../configs/single_captioner/captioner_timeglobal_meanview_detrinit.py \
--work-dir {captioner_save_dir} \
--launcher pytorch
Captioning Evaluate (with oracle trajectories) on validation/test set (refer to captioner_dist_test.sh)
model_dir={captioner_save_dir}
config=captioner_timeglobal_meanview_detrinit.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 --master_port 1111 embodied_trajcap_test.py \
../configs/single_captioner/$config \
./work_dirs/$model_dir/latest.pth \
--use_val_best_checkpoint True \
--eval_set anno/navicaption_v1_val.json \
--eval CIDEr \
--calculate_metrics_with_saved_result False \
--launcher pytorch
Captioning Evaluate (with predicted trajectories) on validation/test set (refer to captioner_navinfer_dist_test.sh)
- After test the Navigator on the val/test set, run the following script to transfer navigation results to the input format of the Captioner
# revise the navi_model_dir first
python navi_result_format_transfer.py
- Test the Captioner with trajectories given by the Navigator
# before running the following script,
# revise navi_model_dir in the corresponding config file (suffixed with '_navinfer.py').
config=captioner_timeglobal_meanview_detrinit_navinfer.py
model_dir={captioner_model_dir}
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 --master_port 1000 embodied_trajcap_test.py \
../configs/single_captioner/$config \
./work_dirs/$model_dir/latest.pth \
--use_val_best_checkpoint True \
--eval_set navigation_v1_trajcapinfer_val.json \
--eval CIDEr \
--calculate_metrics_with_saved_result False \
--path_len_weight_eval True \
--launcher pytorch
The raw Spice in pycocoevalcap is not suitable for paragraph evaluation. To calculate Spice for this task, we build a ParagraphSpice project (~2.25G). Download it from:
- Baidu Cloud Driver (pwd:hted)
- OneDriver
# revise caption_model, pred_file and gt_file, then
python embodied_caption_spice_eval.py
if you find this code useful for your research, please consider citing:
@inproceedings{hu2023explore,
title={Explore and Tell: Embodied Visual Captioning in 3D Environments},
author={Hu, Anwen and Chen, Shizhe and Zhang, Liang and Jin, Qin},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={2482--2491},
year={2023}
}