huawei-noah/Pretrained-IPT

confused about the function `forward_chop`

Closed this issue ยท 22 comments

Thank you for providing the wonderful codes.
However, I am confused about the function forward_chop in model/__init.py.
It seems that this function unfolds the input image into several patches and then feeds those patches into the IPT model, but I didn't find any detailed explanation in the paper or as comments in the code.
For example, what does shave mean here?
If I want to unfold the input image into non-overlapped patches, how should I do it?

Thank you.

In paper, we mentioned that "During the test, we crop the images in the test set into 48 ร— 48 patches with a 10 pixels overlap." The reason is that our transformer model can only handle images with fixed input shape.

shave means the overlap pixel, you can set shave=0 to unfold image into non-overlapped patches for test, but the performance may drop a little.

from the code 153-161:

padsize = int(self.patch_size)
shave = int(self.patch_size/2)
scale = self.scale[self.idx_scale]
h_cut = (h-padsize)%(int(shave/2))
w_cut = (w-padsize)%(int(shave/2))
x_unfold = torch.nn.functional.unfold(x, padsize, stride=int(shave/2)).transpose(0,2).contiguous()

Let's say padsize is 48.
Doesn't that mean shave is 24 and stride is 12?
I am also confused by the relation between shave and stride.

If we set shave to 0, then stride will be 0, which is still weird......

Sorry for the misleading. I just found that we upload another version for chop (performance is slightly higher).

So in this version, the height and width of the patch will be divide by 4 (12 for 48x48 inputs), and 12 pixels in the edge will be dropped, then 24*24 patches will be folded with a 12 overlap.

If you want the original version (e.g., the code to unfold the input image into non-overlapped patches), we can upload it as an option.

Thank you for the prompt reply.
If I want to modify the current code into a "non-overlapped unfold/fold" version, how should I do that?
If you can upload a code for this version, that would be perfect!

Thank you.

You can find that in

def forward_chop_new(self, x, shave=12, batchsize = 64):

Thank you for sharing!
Another question is:
In lines 193-196 in forward_chop, there is a piece of codes like follows:

y_ones = torch.ones(y_inter.shape, dtype=y_inter.dtype)
divisor = torch.nn.functional.fold(torch.nn.functional.unfold(y_ones, padsizescale-shavescale, stride=int(shave/2*scale)),((h-h_cut-shave)*scale,(w-w_cut-shave)*scale), padsize*scale-shave*scale, stride=int(shave/2*scale))
y_inter = y_inter/divisor

I checked the PyTorch official website, it seems that the codes are used because unfold and fold are not invertible,

However, in forward_chop_new, there are no such kinds of codes.
Is there any reason?

Thank you.

In forward_chop, the unfold patches have overlaps. (More specifically, we not only cut pixels in the edges and merge the cutted images). While in forward_chop_new, we cut the pixels in the edges directly and each cutted patch do not have overlaps.

I see. Thank you for the explanation.

x_unfold = torch.nn.functional.unfold(x, padsize, stride=padsize-shave).transpose(0,2).contiguous()

But why can't we set stride=padsize, but set stride=padsize-shave instead?
Doesn't that mean there are still overlaps between 48*48 blocks?

Yes, there are overlaps, but we will cut the pixels in the edges:

y_unfold = y_unfold[...,int(shave/2*scale):padsize*scale-int(shave/2*scale),int(shave/2*scale):padsize*scale-int(shave/2*scale)].contiguous()

what happens if we unfold the image into non-overlapping 48*48 blocks?
If it's doable, how should I modify the codes?
Thank you.

what happens if we unfold the image into non-overlapping 48*48 blocks?
If it's doable, how should I modify the codes?
Thank you.

Just set shave=0 in forward_chop_new

cool. Thank you.
What's the drawback if we set shave=0 (e.g., performance drops a lot)?

Yes, the performance will drops a lot (since the pixel in the edge will perform bad). Besides, the transition between different patches will be sharp and uneven.

Got it. That's clearer.
Thank you so much!

One last question:
It seems that the final output y is the combination of y_inter, y_h_cut, y_w_cut, y_h_top, y_h_top, y_hw_cut.
However, there are overlaps between them.
How do you deal with the overlaps and then generate the final output y?
Are there any details explained in the paper?

Thank you.

The edge of each patch is cutted, so when we put them together, the edge of the whole image is also cutted. That's the reason why we calculate y_h_cut, y_w_cut, y_h_top, y_h_top, y_hw_cut can put them into the edge of the whole image. Besides, the size of whole image may not be an integral multiple of patch size, so there must be some overlap (we put them in y_h_cut, y_w_cut).

If you set shave=0, it is not necesarry to calculate some of them. But you can also use this code since the final output is exactly the same.

I am trying forward_chop_new with shave=0. (i.e., there are exactly no overlaps)
And I thought in this line:

y[...,:padsize*scale,:] = y_h_top

y[...,:padsize*scale,:] and y_h_top should be the same so that lines 286 & 287 are not necessary.
However, I found that y[...,:padsize*scale,:] and y_h_top are slightly different.
(i.e., cropping then feeding to IPT != feeding to IPT then cropping)

Do you know why this happens?

Thank you.

I am trying forward_chop_new with shave=0. (i.e., there are exactly no overlaps)
And I thought in this line:

y[...,:padsize*scale,:] = y_h_top

y[...,:padsize*scale,:] and y_h_top should be the same so that lines 286 & 287 are not necessary.
However, I found that y[...,:padsize*scale,:] and y_h_top are slightly different.
(i.e., cropping then feeding to IPT != feeding to IPT then cropping)
Do you know why this happens?

Thank you.

That's strange. Maybe the nn.layernorm is calculated differently when feeding with different batches. I think this will not affect its performance.

I also have a question about the inputs that you feed into the Transformer:

y_unfold.append(self.model.forward(x_unfold[i*batchsize:(i+1)*batchsize,...]))

Assume x_unfold has the shape [25, 3, 48, 48] (25 patches, each one has resolution 48x48).
And assume after the head encoder, the input becomes [25, 32, 48, 48].

The part that makes me confused is that there is another unfold function before the Transformer encoder:

x = torch.nn.functional.unfold(x,self.patch_dim,stride=self.patch_dim).transpose(1,2).transpose(0,1).contiguous()

Since self.patch_dim=3, this makes x becomes [(16*16), 25, (32*3*3)] and become the inputs of the multi-head attention.
This means that the sequence length for the attention is 256 here, not 25.
Does that mean the attention mechanism is not used to learn the relation between 25 patches, but learn the relation between tiny patches (resolution: 3x3) inside each patch instead?
If so, this seems different from the original ViT paper, which uses attention to learn the relation between patches inside an image.
Why is this difference here? Is it because IPT is for low-level vision tasks, not classification like ViT?

Thank you.

I think there might be some misunderstanding.

The inputs of our IPT model is exactly 48483 images (whose scale is 2242243 in ViT). And the 4848 images will be cropped into 33 small patches with 16*16 seq_length.

The fowrard_chop function is used for handling different input size, since our IPT can only take 48*48 size images.

I see. Now it's much clearer.
I also want to check if I understand the paper correctly.
In Section 3.1, H and W are both 48, and P is 3. Is that correct?

I see. Now it's much clearer.
I also want to check if I understand the paper correctly.
In Section 3.1, H and W are both 48, and P is 3. Is that correct?

Yes

Got it. Thank you so much!!!