Number of ResNet block
ThomasRochefortB opened this issue · 3 comments
The paper mentions on page 4 "Tokens belonging to image patches for any time-step are embedded using a single ResNet block".
So we could update the Readme.md and just use a number of 1 ResNet block during image embedding.
Also, the ResNet embedding is done on single image patches instead of the whole complete image as it is hinted on the figure "Full Episode Sequence" (See Figure 15 in the appendix)
The Figure 5 shows they uses patches from complete image for full episode sequences. (They explicitly marked them with comma and ellipsis)
Also in page 3, "Images are first transformed into sequences of non-overlapping 16 x 16 patches in raster order, as done in ViT ...".
Therefore we must use the whole complete image for a single observation sequences
As you said, however, we can update the code to use single ResNet block to make 16 x 16 patches.
Now I understand what you said
I have read the paper of Gato and ViT again to clarify how Gato handles the image patches.
Like Vision Transformer (but which is not hybrid), the input images must be patched into 16x16 prior to be embedded via ResNet.
What I understand about the process is:
- For image captioning task:
224x224x3 → 196x16x16x3 (patching) → 196x1x1x768 (using single ResNet block) → 196x768 (reshaping) - For Atari task:
64x80x3 → 20x16x16x3 (patching) → 20x1x1x768 (using single ResNet block) → 20x768 (reshaping)
I'm gonna update the ResNet code