microsoft/X-Decoder

Referring captioning demo not using grounding mask

bhpfelix opened this issue · 1 comments

Hi, thanks for the great work! Quick question about demo/demo_refcap.py: the grounding mask is zeroed out at this line, which seems counterintuitive if we want to pass it to the cross-attention layers. Should the line be removed for proper behavior?

Thanks so much for the question and dig the code too much! That line could be removed for proper behavior, I add it for debug purpose. Please go ahead to do this, I will update the code shortly.