vimalabs/VIMABench

[Test] The test result is not consistent with that reported in paper.

Closed this issue · 2 comments

Hi, I tried to replement the test period via your given command python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}. For more detailes, in my experiment, I used the 200M.ckpt.

Specifically,

  1. I excute the command mentioned before, by using 100 instances per task as the test sample.
  2. Then the success ratio is obtained by obs, _, done, info = env.step(...).
  3. I got the success ratio by averaging the result according to the L1-L4.

However, I found that the result I obtained is far from that in your paper. The following table is my experimental result, and its sucess ratio is too lower than your result.
By the way, the result of L1 and L2 is too similary. Is there any bug in my test period?

  L1 L2 L3 L4
  succ fail succ fail succ fail succ fail
Simple Object Manipulation:visual_manipulation 99 1 94 6 100 0    
Simple Object Manipulation:scene_understanding 100 0 98 2 96 4    
Simple Object Manipulation:rotate 100 0 100 0 100 0    
Visual Goal Reaching:rearrange 49 51 49 51 49 51    
Visual Goal:reaching:rearrange_then_restore 10 90 12 88 11 89    
Novel Concept Grounding:novel_adj 99 1 100 0 99 1    
Visual Reasoning:noval_noun 97 3 97 3 99 1    
Novel Concept Grounding:novel_adj_and_noun             98 2
Novel Concept Grounding:twist 1 99 4 96 0 100    
One-shot Video Imitation:follow_motion             0 100
One-shot Video Imitation:follow_order 44 56 45 55 47 53    
Visual Constraint Satisfaction:sweep_without_exceeding 67 33 67 33        
Visual Constraint Satisfaction:sweep_without_touching             0 100
Visual Reasoning:same_texture             50 50
Visual Reasoning:same_shape 50 50 50 50 50 50    
Visual Reasoning:manipulate_old_neighbor 47 53 47 53 37 63    
Visual Reasoning:pick_in_order_then_restore 11 89 10 90 13 87    
num 774 526 773 527 701 499 148 252
success ratio 59.54   59.46   58.4   0.37  
  • the empty denotes the example.py does not support

At the same time, I don't find the usage of mask R-CNN. The bbox is not recognized by any models, but given by the env (if I dont miss anything). Could you provide more details about this?
In addition, the dimension of action pos0_position in the given training data is different from that in the test enviroment. The former is 3 while the latter is 2. It makes me curious that how can I convert the training action space to the test action space.

Closed as duplicate here.