OpenGVLab/Instruct2Act

incorrect coordinate detected of base_obj

Breezewrf opened this issue · 3 comments

Hello,

Thank you for your awesome work!
I meet a problem that the dragged_obj_1 can not be put into the base_obj correctly. I have checked the output of CLIPRetrieval() function. In my testing, the coordinate of dragged_obj_1 is 2, and base_obj is 3, and sometimes they are both 2. Could you give me some suggestions what causes the problem. May be I should try some larger model? I use sam_vit_b_01ec64 the save the memory. But I do not think it is the problem of model, since the robot can pick the dragged_obj_1 every time accrurately, just put it down to the same position rather than the position of base_obj.

Looking forward to you reply!

Hi, thanks for your interest in our project.

I would suggest the following:

  1. Try to disable the rendering of the robotic arm and see what happens. In our evaluation experiments, the occlusion of the robotic arm and the target objects is always the biggest problem, and SAM cannot handle that heavy occlusion. As we have denoted in the readme when conducting the evaluation, the robotic arm is not rendered.

  2. Try some other tasks, and see what happens.

If you have any further issue, just re-open this issus.

Hope that could help.

Hi Siyuan,

The problem has been settled down by using the laion2b_s34b_b79k openclip model
Since I did not download the clip model preliminary, so I modified the code of model loading in order to download the model automatically:
clip_index = 3
model, _, preprocess = open_clip.create_model_and_transforms( models[clip_index], device=device, pretrained="openai", )
It seems the model I used is not correct, I am new to LLM, if you have any suggestions, please let me know, appreciate.
And now I modified the code to:
clip_index = 1
model, _, preprocess = open_clip.create_model_and_transforms( models[clip_index], device=device, pretrained="laion2b_s34b_b79k", )
which can be used to download the model automatically and detect the object accurately.

Hi,

As we stated in our paper, the bigger model you used, the better accuracy you obtained.

The model I wrote is only used for the ablation of different CLIP models, and you just need to load the model and process with OpenCLIP, and no need for clip_index or something else. Also, I am conducting eval on a PC without an internet connection, which is the reason why I have to load it locally, but you do not need to follow it strictly. You can refer to OpenCLIP for more details.

Bests,