TencentARC/ViT-Lens

reproduce evaluation results

waterljwant opened this issue · 3 comments

Hi,
Thank you for the great open-source work.

However, I am currently facing some difficulties in reproducing the evaluation results, particularly regarding the scene classification on NYU-D and SUN-D. I have attached the results I obtained after executing the provided script.
Could you please assist me in identifying any possible steps or details that I might have missed, leading to this inconsistency in accuracy?
image

Hi,

To reproduce results on NYU-D and SUN-D,

  1. Please follow the instruction for inference: download vitlensL-depth checkpoint.
    cd vitlens/
    # you may change the path accordingly
    torchrun --nproc_per_node=1 ./src/training/depth/depth_tri_main.py \
      --cache_dir /path_to/cache \
      --val-data sun-rgbd::nyu-depth-v2-val1::nyu-depth-v2-val2 \
      --visual_modality_type depth --dataset-type depth --v_key depth \
      --n_tower 3 \
      --use_perceiver  --perceiver_cross_dim_head 64 --perceiver_latent_dim 1024 --perceiver_latent_dim_head 64 --perceiver_latent_heads 16 \
      --perceiver_num_latents 256 --perceiver_as_identity \
      --use_visual_adapter \
      --batch-size 64 \
      --lock-image --lock-text --lock-visual --unlock-trans-first-n-layers 4 \
      --model ViT-L-14 --pretrained datacomp_xl_s13b_b90k \
      --name depth/inference_vitlensL_perf \
      --resume /path_to/vitlensL_depth.pt
  2. We follow ImageBind for data preprocessing (convert to disparity), please also make sure you use the same operation. See here. I also uploaded a copy here.

If you still cannot reproduce the results (Table 5 in the paper), you may provide your env setup so that i can look into this.

image

Btw, results from my side following the installation setup (pytorch==1.11.0), for your reference.

@StanLei52 Thank you! I have found that I mistakenly used different depth data. After adjusting according to this code depth_dir = os.path.join(path, "depth_bfx"), the accuracy is consistent.