[Test] The test result is not consistent with that reported in paper.
aopolin-lv opened this issue · 11 comments
Hi, I tried to replement the test period via your given command python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}
. For more detailes, in my experiment, I used the 200M.ckpt
.
Specifically,
- I excute the command mentioned before, by using 100 instances per task as the test sample.
- Then the success ratio is obtained by
obs, _, done, info = env.step(...)
. - I got the success ratio by averaging the result according to the L1-L4.
However, I found that the result I obtained is far from that in your paper. The following table is my experimental result, and its sucess ratio is too lower than your result.
By the way, the result of L1 and L2 is too similary. Is there any bug in my test period?
L1 | L2 | L3 | L4 | |||||
---|---|---|---|---|---|---|---|---|
succ | fail | succ | fail | succ | fail | succ | fail | |
Simple Object Manipulation:visual_manipulation | 99 | 1 | 94 | 6 | 100 | 0 | ||
Simple Object Manipulation:scene_understanding | 100 | 0 | 98 | 2 | 96 | 4 | ||
Simple Object Manipulation:rotate | 100 | 0 | 100 | 0 | 100 | 0 | ||
Visual Goal Reaching:rearrange | 49 | 51 | 49 | 51 | 49 | 51 | ||
Visual Goal:reaching:rearrange_then_restore | 10 | 90 | 12 | 88 | 11 | 89 | ||
Novel Concept Grounding:novel_adj | 99 | 1 | 100 | 0 | 99 | 1 | ||
Visual Reasoning:noval_noun | 97 | 3 | 97 | 3 | 99 | 1 | ||
Novel Concept Grounding:novel_adj_and_noun | 98 | 2 | ||||||
Novel Concept Grounding:twist | 1 | 99 | 4 | 96 | 0 | 100 | ||
One-shot Video Imitation:follow_motion | 0 | 100 | ||||||
One-shot Video Imitation:follow_order | 44 | 56 | 45 | 55 | 47 | 53 | ||
Visual Constraint Satisfaction:sweep_without_exceeding | 67 | 33 | 67 | 33 | ||||
Visual Constraint Satisfaction:sweep_without_touching | 0 | 100 | ||||||
Visual Reasoning:same_texture | 50 | 50 | ||||||
Visual Reasoning:same_shape | 50 | 50 | 50 | 50 | 50 | 50 | ||
Visual Reasoning:manipulate_old_neighbor | 47 | 53 | 47 | 53 | 37 | 63 | ||
Visual Reasoning:pick_in_order_then_restore | 11 | 89 | 10 | 90 | 13 | 87 | ||
num | 774 | 526 | 773 | 527 | 701 | 499 | 148 | 252 |
success ratio | 59.54 | 59.46 | 58.4 | 0.37 |
- the empty denotes the
example.py
does not support
At the same time, I don't find the usage of mask R-CNN. The bbox is not recognized by any models, but given by the env (if I dont miss anything). Could you provide more details about this?
Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?
@yunfanjiang Maybe I've missed it but where is the MaskRCNN model used during online evaluation?
Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?
I solved it just now. And I have the same question as that of amitkparekh.
Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?
I solved it just now. And I have the same question as that of amitkparekh.
@aopolin-lv Can you share some lessons with us, e.g. what needs to watch out and deal with carefully?
Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?
When you ran it, did you literally just run a for loop in bash for each task and partition and dump the metrics to files? The big question here is how much of the code have you changed, if at all.
Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?
When you ran it, did you literally just run a for loop in bash for each task and partition and dump the metrics to files? The big question here is how much of the code have you changed, if at all.
Yes, I just excute 100 instances of each task by a for loop, almost not changing the original code.
Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?
Hi yunfan, I replemented the training of vima, by using the vima baselines. However, I fould it is difficult for models to fit the pose1_rotation attribute. Did you met this problem and could you give me any suggestions?
Hi @aopolin-lv, thanks for the followup. We directly read off segm masks from sim in this script for demo purpose.
During training, we masked rotation loss contributed from tasks other than Rotation and Twist. Since object orientation only matters in these two tasks, optimizing the rotation action head would be dominated by other tasks.
I'll close this issue for now. Feel free to let me know if you have further questions.
I'll close this issue for now. Feel free to let me know if you have further questions.
@yunfanjiang I don't think this issue should be closed as completed as it has not been solved?
Hi @aopolin-lv, thanks for the followup. We directly read off segm masks from sim in this script for demo purpose.
During training, we masked rotation loss contributed from tasks other than Rotation and Twist. Since object orientation only matters in these two tasks, optimizing the rotation action head would be dominated by other tasks.
I'll close this issue for now. Feel free to let me know if you have further questions.
Thank you. With your advice, I have trained the model successfully. However, its performance is too poor in multi step tasks, such as rearrange and then restore
/pick in then restore
/follow_order
/ manipulate_old_neighbor
and rearrange
. Their performance compared with the ogirinal result reported in the paper is 20%-60%+, while other tasks are normal. How can I do with it?
@aopolin-lv Hi! can you share the training code with me? i want to reproduce these baselines (ViMaGPT,ViMaFLamingo). thk a lot