HDETR/H-Deformable-DETR

Question about proposal generation

YanShuang17 opened this issue · 1 comments

Hello @PkuRainBow , thanks for opening source your excellent work !

I have a question about this code patch(line244) in deformable_transformer.py:

...
            topk = self.two_stage_num_proposals
            topk_proposals = torch.topk(enc_outputs_class[..., 0], topk, dim=1)[1]
            topk_coords_unact = torch.gather(
                enc_outputs_coord_unact, 1, topk_proposals.unsqueeze(-1).repeat(1, 1, 4)
            )
...

Tensor enc_outputs_class[..., 0](enc_outputs_class.shape = (batch_size, len_flattened_encoder_seq, 91)) represents the cls prediction of the first fg class ?

In my understanding, The purpose here is to get the topk fg proposals according to topk highest fg scores(including all fg classes).

So why not execute topk = torch.topk(enc_outputs_class.max(dim=-1)[0], topk, dim=1)[1] ?

Could you please give some explanation, thx !

@YanShuang17 This is in fact a tricky implementation of the original deformable-DETR, which converts the 91 classification task into a binary classification task.

Please check the ground-truth conversion at:

if "enc_outputs" in outputs:
enc_outputs = outputs["enc_outputs"]
bin_targets = copy.deepcopy(targets)
for bt in bin_targets:
bt["labels"] = torch.zeros_like(bt["labels"])

In other words, the encoder is supervised to perform foreground object detection indeed instead of the multi-category object detection performed by the following decoder layers.