amazon-science/semimtr-text-recognition

The shape of projector's input

YusenZhang826 opened this issue · 5 comments

Hi, I'm a little confused about the forward function in SeqCLR. In the seqclr_proj.py line 59, the output of the visual backbone is reshaped to # (N, E, H, W) -> (N, H*W, E), but in OCR task they are usually processed as # (N, E, H, W) -> (N, W, E*H). And the explanation in paper is Note that the sequence length depends on the width of the input image. So what is the right shape for the feed of projector? Thanks!

Hi,
Thank you for the interest in our work.
The SeqCLR projection is applied on the working_layer (see here), which can be backbone_feature or feature. In the latter case, the shape of the features is indeed (N, T, E) (see here). If you work on the backbone_feature then the shape is (N, E, H, W) (see here). Therefore, in this case we first reshape them to become (N, H*W, E) (see here).

To your specific suggestion, the goal of the projection is to linearly transform the features into a different (usually of lower-dimension) subspace. Therefore, we only want to change the number of the channels, i.e., E, and to preserve the spatial dimensions of H and W.

Let me know if you have following up questions,
Aviad

Now I know the goal of the projection. Then I want to know how you handle the output of the visual output in the fine-tune stage(training with labeled data)? The tensor shape (N, E, H, W) is reshaped to (N, H*W, E) or (N, W, E*H) before fed into the CTC or attention decoder? I think the latter(N, W, E*H) is more common in text recognition tasks. But it is inconsistent(N, H*W, E) with the pre-training process.
Looking forward to your response. Thanks!

Hi,

In the vision model, there is a transformer unit which is applied after the backbone:

attn_vecs, attn_scores = self.attention(features) # (N, T, E), (N, T, H, W)

This 2D attention layer operates directly on the feature map of the size of (N, T, H, W). It outputs a tensor of the shape of (N, T, E).
To answer your question explicitly - We use 2D attention-based decoder and therefore we don't need the reshape that you mentioned for the supervised fine-tuning.

I hope that it's more clear now,
Aviad

Ok, I will learn more about the code, Thanks again!

You welcome :)
I'm closing the issue. If you have additional questions, you can re-open it.
Aviad