Questions about your training procedure?
GYslchen opened this issue · 1 comments
GYslchen commented
To my understanding, I think you use image-text pairs as inputs and only bbox annotations as supervision signals without any class labels, does it right?
mmaaz60 commented
Hi @GYslchen,
Thank you for your interest in our work. We are using aligned image-text pairs for pretraining our MAVL model. Similar to MDETR, MAVL uses soft-token alignment loss during pretraining where a uniform probability distribution is predicted over all text tokens for each detected object. Please refer to the Sec. 2.2.2 and Appendix A of MDETR paper for more details.
I hope this would be helpful. Please let me know if you have any questions. Thanks