mmaaz60/mvits_for_class_agnostic_od

Questions about your training procedure?

GYslchen opened this issue · 1 comments

To my understanding, I think you use image-text pairs as inputs and only bbox annotations as supervision signals without any class labels, does it right?

Hi @GYslchen,

Thank you for your interest in our work. We are using aligned image-text pairs for pretraining our MAVL model. Similar to MDETR, MAVL uses soft-token alignment loss during pretraining where a uniform probability distribution is predicted over all text tokens for each detected object. Please refer to the Sec. 2.2.2 and Appendix A of MDETR paper for more details.

I hope this would be helpful. Please let me know if you have any questions. Thanks