The shape of projector's input
YusenZhang826 opened this issue · 5 comments
Hi, I'm a little confused about the forward function in SeqCLR. In the seqclr_proj.py
line 59, the output of the visual backbone is reshaped to # (N, E, H, W) -> (N, H*W, E)
, but in OCR task they are usually processed as # (N, E, H, W) -> (N, W, E*H)
. And the explanation in paper is Note that the sequence length depends on the width of the input image
. So what is the right shape for the feed of projector? Thanks!
Hi,
Thank you for the interest in our work.
The SeqCLR projection is applied on the working_layer
(see here), which can be backbone_feature
or feature
. In the latter case, the shape of the features is indeed (N, T, E)
(see here). If you work on the backbone_feature
then the shape is (N, E, H, W)
(see here). Therefore, in this case we first reshape them to become (N, H*W, E)
(see here).
To your specific suggestion, the goal of the projection is to linearly transform the features into a different (usually of lower-dimension) subspace. Therefore, we only want to change the number of the channels, i.e., E, and to preserve the spatial dimensions of H and W.
Let me know if you have following up questions,
Aviad
Now I know the goal of the projection. Then I want to know how you handle the output of the visual output in the fine-tune stage(training with labeled data)? The tensor shape (N, E, H, W)
is reshaped to (N, H*W, E)
or (N, W, E*H)
before fed into the CTC or attention decoder? I think the latter(N, W, E*H)
is more common in text recognition tasks. But it is inconsistent(N, H*W, E)
with the pre-training process.
Looking forward to your response. Thanks!
Hi,
In the vision model, there is a transformer unit which is applied after the backbone:
This 2D attention layer operates directly on the feature map of the size of
(N, T, H, W)
. It outputs a tensor of the shape of (N, T, E)
.To answer your question explicitly - We use 2D attention-based decoder and therefore we don't need the reshape that you mentioned for the supervised fine-tuning.
I hope that it's more clear now,
Aviad
Ok, I will learn more about the code, Thanks again!
You welcome :)
I'm closing the issue. If you have additional questions, you can re-open it.
Aviad