Question about proposal generation
YanShuang17 opened this issue · 1 comments
Hello @PkuRainBow , thanks for opening source your excellent work !
I have a question about this code patch(line244) in deformable_transformer.py
:
...
topk = self.two_stage_num_proposals
topk_proposals = torch.topk(enc_outputs_class[..., 0], topk, dim=1)[1]
topk_coords_unact = torch.gather(
enc_outputs_coord_unact, 1, topk_proposals.unsqueeze(-1).repeat(1, 1, 4)
)
...
Tensor enc_outputs_class[..., 0]
(enc_outputs_class.shape = (batch_size, len_flattened_encoder_seq, 91)) represents the cls prediction of the first fg class ?
In my understanding, The purpose here is to get the topk fg proposals according to topk highest fg scores(including all fg classes).
So why not execute topk = torch.topk(enc_outputs_class.max(dim=-1)[0], topk, dim=1)[1]
?
Could you please give some explanation, thx !
@YanShuang17 This is in fact a tricky implementation of the original deformable-DETR, which converts the 91 classification task into a binary classification task.
Please check the ground-truth conversion at:
H-Deformable-DETR/models/deformable_detr.py
Lines 515 to 519 in 10af735
In other words, the encoder is supervised to perform foreground object detection indeed instead of the multi-category object detection performed by the following decoder layers.