About "X-Decoder-Seg+"
ZhangYuanhan-AI opened this issue · 3 comments
ZhangYuanhan-AI commented
Hi, thanks for this nice work!
Please specify the process of "we take the heuristic way to extract noun phrases from COCO captions and use them as extra supervision on top of the matched decoder outputs".
- What do you mean by "matched decoder outputs"
- How does a "noun phrase" match a decoder output?
Looking forward to your reply!
MaureenZOU commented
Hi Zhangyuan,
Thanks for your interest in our work : )
For X-Decoder-Seg+:
- We extract all the noun phrases from coco caption as N.
- We exclude all the noun that has been appeared in the pano gt using a similarly matching (exclude > 0.95) with unicl pretrained Focal-B (CLIP will also work) as *N.
- After doing hungarian matching using class and mask for panoptic segmentation, we use the remaining object queries that do not matched for panoptic segmentation to match with class embedding of *N. Then train a multi-gpu contrastive loss on the matched representations.
ZhangYuanhan-AI commented
Thanks for your prompt response.
From my understanding, "object queries" is the "latent queries". Say there are two latent queries Os1, Os2 , and two noun phases *N1, and *N2, then how to match Os1 to either *N1, or *N2?
MaureenZOU commented
- First exclude all the nouns that has been appear in the panoseg, and apply the following matching Func:
@torch.no_grad()
def caption_forward_womask(self, outputs, targets, extra):
"""More memory-friendly matching"""
bs, _ = outputs["pred_logits"].shape[:2]
if bs == 0 or len(targets) == 0:
return None
indices = []
t_emb = torch.cat([t['captions'] for t in targets])
v_emb = outputs['unmatched_pred_captions']
caption_target_count = np.cumsum([0] + [len(t['captions']) for t in targets])
# Iterate through batch size
for b in range(bs):
v_emb[b] = v_emb[b] / (v_emb[b].norm(dim=-1, keepdim=True) + 1e-7)
num_queries = len(v_emb[b])
out_prob = vl_similarity(v_emb[b][None,], t_emb, temperature=extra['temperature']).softmax(-1)[0]
tgt_ids = [idx for idx in range(caption_target_count[b], caption_target_count[b+1])]
# Compute the classification cost. Contrary to the loss, we don't use the NLL,
# but approximate it in 1 - proba[target class].
# The 1 is a constant that doesn't change the matching, it can be ommitted.
cost_class = -out_prob[:, tgt_ids]
# Final cost matrix
C = (self.cost_class * cost_class)
C = C.reshape(num_queries, -1).cpu()
indices.append(linear_sum_assignment(C))
return [
(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64))
for i, j in indices
]