[Test] The test result is not consistent with that reported in paper.

Question

[Test] The test result is not consistent with that reported in paper.

aopolin-lv opened this issue a year ago · 11 comments

Hi, I tried to replement the test period via your given command python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}. For more detailes, in my experiment, I used the 200M.ckpt.

Specifically,

I excute the command mentioned before, by using 100 instances per task as the test sample.
Then the success ratio is obtained by obs, _, done, info = env.step(...).
I got the success ratio by averaging the result according to the L1-L4.

However, I found that the result I obtained is far from that in your paper. The following table is my experimental result, and its sucess ratio is too lower than your result.
By the way, the result of L1 and L2 is too similary. Is there any bug in my test period?

	L1		L2		L3		L4
	succ	fail	succ	fail	succ	fail	succ	fail
Simple Object Manipulation:visual_manipulation	99	1	94	6	100	0
Simple Object Manipulation:scene_understanding	100	0	98	2	96	4
Simple Object Manipulation:rotate	100	0	100	0	100	0
Visual Goal Reaching:rearrange	49	51	49	51	49	51
Visual Goal:reaching:rearrange_then_restore	10	90	12	88	11	89
Novel Concept Grounding:novel_adj	99	1	100	0	99	1
Visual Reasoning:noval_noun	97	3	97	3	99	1
Novel Concept Grounding:novel_adj_and_noun							98	2
Novel Concept Grounding:twist	1	99	4	96	0	100
One-shot Video Imitation:follow_motion							0	100
One-shot Video Imitation:follow_order	44	56	45	55	47	53
Visual Constraint Satisfaction:sweep_without_exceeding	67	33	67	33
Visual Constraint Satisfaction:sweep_without_touching							0	100
Visual Reasoning:same_texture							50	50
Visual Reasoning:same_shape	50	50	50	50	50	50
Visual Reasoning:manipulate_old_neighbor	47	53	47	53	37	63
Visual Reasoning:pick_in_order_then_restore	11	89	10	90	13	87
num	774	526	773	527	701	499	148	252
success ratio	59.54		59.46		58.4		0.37

the empty denotes the example.py does not support

At the same time, I don't find the usage of mask R-CNN. The bbox is not recognized by any models, but given by the env (if I dont miss anything). Could you provide more details about this?

Answer 1 · 2023-09-26T05:37:19.000Z

Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?

Answer 2 · 2023-09-26T07:09:39.000Z

@yunfanjiang Maybe I've missed it but where is the MaskRCNN model used during online evaluation?

Answer 3 · 2023-09-26T08:04:21.000Z

Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?

I solved it just now. And I have the same question as that of amitkparekh.

Answer 4 · 2023-09-27T06:43:50.000Z

Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?

I solved it just now. And I have the same question as that of amitkparekh.

@aopolin-lv Can you share some lessons with us, e.g. what needs to watch out and deal with carefully?

Answer 5 · 2023-09-27T07:35:47.000Z

Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?

@aopolin-lv

When you ran it, did you literally just run a for loop in bash for each task and partition and dump the metrics to files? The big question here is how much of the code have you changed, if at all.

Answer 6 · 2023-10-01T10:00:57.000Z

Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?

@aopolin-lv

When you ran it, did you literally just run a for loop in bash for each task and partition and dump the metrics to files? The big question here is how much of the code have you changed, if at all.

Yes, I just excute 100 instances of each task by a for loop, almost not changing the original code.

Answer 7 · 2023-10-01T10:04:16.000Z

Hi there, thank you for trying out. Could you provide more details (e.g., code snippet) so I can take a look?

Hi yunfan, I replemented the training of vima, by using the vima baselines. However, I fould it is difficult for models to fit the pose1_rotation attribute. Did you met this problem and could you give me any suggestions?

Answer 8 · 2023-10-07T04:54:44.000Z

Hi @aopolin-lv, thanks for the followup. We directly read off segm masks from sim in this script for demo purpose.

During training, we masked rotation loss contributed from tasks other than Rotation and Twist. Since object orientation only matters in these two tasks, optimizing the rotation action head would be dominated by other tasks.

I'll close this issue for now. Feel free to let me know if you have further questions.

Answer 9 · 2023-10-07T16:15:06.000Z

I'll close this issue for now. Feel free to let me know if you have further questions.

@yunfanjiang I don't think this issue should be closed as completed as it has not been solved?

Answer 10 · 2023-10-18T09:28:00.000Z

Hi @aopolin-lv, thanks for the followup. We directly read off segm masks from sim in this script for demo purpose.

During training, we masked rotation loss contributed from tasks other than Rotation and Twist. Since object orientation only matters in these two tasks, optimizing the rotation action head would be dominated by other tasks.

I'll close this issue for now. Feel free to let me know if you have further questions.

Thank you. With your advice, I have trained the model successfully. However, its performance is too poor in multi step tasks, such as rearrange and then restore/pick in then restore/follow_order / manipulate_old_neighbor and rearrange. Their performance compared with the ogirinal result reported in the paper is 20%-60%+, while other tasks are normal. How can I do with it?

Answer 11 · 2024-03-12T18:08:46.000Z

@aopolin-lv Hi! can you share the training code with me? i want to reproduce these baselines (ViMaGPT,ViMaFLamingo). thk a lot