microsoft/X-Decoder

Training Implementation

Closed this issue · 1 comments

In the current training configuration, the X-decoder utilizes roughly five different objectives from each task, including masking, grounding, caption, captioning, and retrieval. (honestly, I cannot fully understand the difference between "caption" and "captioning").

In the official paper, you guys mentioned using ITC (Imate-Text Contrastive), MLM (Text Generative), and Mask objectives for X-decoder training, which significantly differs from the statement above.

I am curious about the details of the current training implementation. Also, if I were to train the X-decoder solely based on the objectives outlined in the official paper, I am wondering whether this would impact performance degradation.

The code provided here is exactly how I have trained the model (there may be minor mistakes due to code base migration). For a detailed comparison, actually, the loss we mentioned in the paper is the same as the code implemented, while there are minor variations that we have adjusted.