Detach for text

Hi, I am quite confused about the loss computation. For computing the loss for learnable queries, I saw the text features are detached and thus will not be computing the gradient.

X-Decoder/xdecoder/body/decoder/xdecoder.py

Line 230 in fca01f6

_caping_lang_embed = caping_lang_embed.detach().clone()

Hi,
Text features are not detached for all the settings, they detach on output (per-layer) but attach on query_emb. This is an empirical design choice.

So, during the training on the tasks related to learnable queries (e.g., segmentation, grounding), the text features are always detached?

Nope, please go back to the code:

X-Decoder/xdecoder/body/decoder/xdecoder.py

Line 233 in fca01f6

    
           query_embed = torch.cat((query_embed, caping_lang_embed), dim=0) # may not add at the beginning.

Query embedding is attached.