microsoft/X-Decoder

Clarification on Referring Segmentation

yxchng opened this issue · 1 comments

yxchng commented

Based on the code:

texts_grd.append([x['raw'].lower() for x in ann['sentences']])

t_emb = getattr(self.sem_seg_head.predictor.lang_encoder, "{}_text_embeddings".format('grounding')).t()
v_emb = caption_pred_result[:-1]
v_emb = v_emb / (v_emb.norm(dim=-1, keepdim=True) + 1e-7)
vt_sim = v_emb @ t_emb
max_id = vt_sim.max(0)[1][0]
grd_masks += [mask_pred_result[max_id]]

Is it true that referring segmentation in X-Decoder is done by segmentation -> classification (matching mask with highest similarity)?

Yes, but we are not directly matching the mask embedding, we compute hungarian matching between pred mask and gt mask, pred text embedding and gt text embedding. During evaluation we use text embedding matching score only.