Question about proposal generation

Hello @PkuRainBow , thanks for opening source your excellent work !

I have a question about this code patch(line244) in deformable_transformer.py:

...
            topk = self.two_stage_num_proposals
            topk_proposals = torch.topk(enc_outputs_class[..., 0], topk, dim=1)[1]
            topk_coords_unact = torch.gather(
                enc_outputs_coord_unact, 1, topk_proposals.unsqueeze(-1).repeat(1, 1, 4)
            )
...

Tensor enc_outputs_class[..., 0](enc_outputs_class.shape = (batch_size, len_flattened_encoder_seq, 91)) represents the cls prediction of the first fg class ?

In my understanding, The purpose here is to get the topk fg proposals according to topk highest fg scores(including all fg classes).

So why not execute topk = torch.topk(enc_outputs_class.max(dim=-1)[0], topk, dim=1)[1] ?

Could you please give some explanation, thx !

@YanShuang17 This is in fact a tricky implementation of the original deformable-DETR, which converts the 91 classification task into a binary classification task.

Please check the ground-truth conversion at:

H-Deformable-DETR/models/deformable_detr.py

Lines 515 to 519 in 10af735

    
           if "enc_outputs" in outputs: 
        
               enc_outputs = outputs["enc_outputs"] 
        
               bin_targets = copy.deepcopy(targets) 
        
               for bt in bin_targets: 
        
                   bt["labels"] = torch.zeros_like(bt["labels"])

In other words, the encoder is supervised to perform foreground object detection indeed instead of the multi-category object detection performed by the following decoder layers.

	if "enc_outputs" in outputs:
	enc_outputs = outputs["enc_outputs"]
	bin_targets = copy.deepcopy(targets)
	for bt in bin_targets:
	bt["labels"] = torch.zeros_like(bt["labels"])