vimalabs/VIMA

As the input are images of single objects, so how does the model know the relative position and distance between objects

zhufq00 opened this issue · 1 comments

I have read this paper and it is very interesting, I assume that there are images of full scenes are input to the model. But I didn't find relevant pieces about that. All I see is that objects in the full scenes are extracted as images of single objects. How does this model know the relative position and distance between objects. Thank you very much.

Thank for your interest in our project. For object-centric representation, as mentioned in Sec.4 Tokenization, we also encode bounding box coordinates. These features are then fused with objects' image features to provide object tokens.