microsoft/X-Decoder

About "X-Decoder-Seg+"

ZhangYuanhan-AI opened this issue · 3 comments

Hi, thanks for this nice work!

Please specify the process of "we take the heuristic way to extract noun phrases from COCO captions and use them as extra supervision on top of the matched decoder outputs".

  1. What do you mean by "matched decoder outputs"
  2. How does a "noun phrase" match a decoder output?

Looking forward to your reply!

Hi Zhangyuan,

Thanks for your interest in our work : )

For X-Decoder-Seg+:

  1. We extract all the noun phrases from coco caption as N.
  2. We exclude all the noun that has been appeared in the pano gt using a similarly matching (exclude > 0.95) with unicl pretrained Focal-B (CLIP will also work) as *N.
  3. After doing hungarian matching using class and mask for panoptic segmentation, we use the remaining object queries that do not matched for panoptic segmentation to match with class embedding of *N. Then train a multi-gpu contrastive loss on the matched representations.

Thanks for your prompt response.

From my understanding, "object queries" is the "latent queries". Say there are two latent queries Os1, Os2 , and two noun phases *N1, and *N2, then how to match Os1 to either *N1, or *N2?

  1. First exclude all the nouns that has been appear in the panoseg, and apply the following matching Func:
    @torch.no_grad()
    def caption_forward_womask(self, outputs, targets, extra):
        """More memory-friendly matching"""
        bs, _ = outputs["pred_logits"].shape[:2]

        if bs == 0 or len(targets) == 0:
            return None

        indices = []
        t_emb = torch.cat([t['captions'] for t in targets])
        v_emb = outputs['unmatched_pred_captions']
        caption_target_count = np.cumsum([0] + [len(t['captions']) for t in targets])

        # Iterate through batch size
        for b in range(bs):
            v_emb[b] = v_emb[b] / (v_emb[b].norm(dim=-1, keepdim=True) + 1e-7)
            num_queries = len(v_emb[b])
            out_prob = vl_similarity(v_emb[b][None,], t_emb, temperature=extra['temperature']).softmax(-1)[0]
            tgt_ids = [idx for idx in range(caption_target_count[b], caption_target_count[b+1])]

            # Compute the classification cost. Contrary to the loss, we don't use the NLL,
            # but approximate it in 1 - proba[target class].
            # The 1 is a constant that doesn't change the matching, it can be ommitted.
            cost_class = -out_prob[:, tgt_ids]

            # Final cost matrix
            C = (self.cost_class * cost_class)
            C = C.reshape(num_queries, -1).cpu()
            indices.append(linear_sum_assignment(C))

        return [
            (torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64))
            for i, j in indices
        ]