dingjiansw101/ZegFormer

question about how cls_score is computed

Opened this issue · 1 comments

Thank you for your great contribution.

In transformer_zeroshot_predictor.py, cls_score is computed as follow in your code:

if self.mask_classification:
    x_cls = self.projection_layer(hs)
    # TODO: check if it is l2 norm
    x_cls = x_cls / x_cls.norm(dim=-1, keepdim=True)
    logit_scale = self.logit_scale.exp()
    if self.training:
        cls_score = logit_scale * x_cls @ self.text_features.clone().detach().t()
    else:
        cls_score = logit_scale * x_cls @ self.text_features_test.clone().detach().t()

    bg_score = logit_scale * x_cls @ self.bg_feature.t()
    outputs_class = torch.cat((cls_score, bg_score), -1)
    out = {"pred_logits": outputs_class[-1]}

x_cls has the shape of [num_dec_layer, bsz, num_queries, hidden_dim], which is [6, 32, 100, 512].
cls_score is computed by matrix multiplication between x_cls shaped [6, 32, 100, 512] and text_features.t() shaped [512, 15]. That makes the shape of cls_score [6, 32, 100, 15].
Then, it is concat with bg_score of shape [6, 32, 100, 1] along last dim to obtain outputs_class with shape [6, 32, 100, 16].
Finally, out["pred_logits"] is outputs_class[-1].

As I understand the code, the same computations are being made along dimension 0 (6 layers) but only the output of the final layer is extracted as pred_logits. Therefore, I extract the final layer of x_cls from the beginning:

if self.mask_classification:
    x_cls = self.projection_layer(hs)
    # TODO: check if it is l2 norm
    x_cls = x_cls / x_cls.norm(dim=-1, keepdim=True)
    x_cls = x_cls[-1]                                                        # Extract final layer of x_cls
    logit_scale = self.logit_scale.exp()
    if self.training:
        cls_score = logit_scale * x_cls @ self.text_features.clone().detach().t()
    else:
        cls_score = logit_scale * x_cls @ self.text_features_test.clone().detach().t()

    bg_score = logit_scale * x_cls @ self.bg_feature.t()
    outputs_class = torch.cat((cls_score, bg_score), -1)
    out = {"pred_logits": outputs_class}                       # return

The shape of out["pred_logits"] is the same in two cases and I expect normal training.
However, I encounter this error while training with my modified code:

ERROR [01/10 13:23:10 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
....
File "/home/phuongln6/ZegFormer/mask_former/modeling/matcher.py", line 117, in memory_efficient_forward
    cost_class = -out_prob[:, tgt_ids]
IndexError: too many indices for tensor of dimension 1

It turns out that when I set batch_size=32, this error happens when training sample 33 is reached.

Can you help debug my code?

Hi, maybe you should use the shape [1, 32, 100, 512] instead of [32, 100, 512]?