resolution
elhamAm opened this issue · 2 comments
Hello!
thanks for the great work!
The attention map does not work when the resolution of the image is not 512 x 512 and is for example 256x256.
There is a problem in the process_attn part because the attention map is not of size 256 so the square root is 11 which does not exist in up_attn. Is there a way to make this work for other resolutions especially none square?
Hi, thanks for your interest in our work!
I think it is possible to modify the model to support different resolutions. The attention layers actually do not require the feature map to be strictly square. You need to pad the feature map with zeros before the down-sampling layers such that the H,W of the feature maps are not odd. You would also need to modify the process_attn to retrieve the cross-attention maps you want to use in your task-specific decoder.
@elhamAm Hello!
Have you resolved the issue regarding arbitrary image resolution as a VPD input?If it has been resolved, I would be grateful if you could please share this part of the code with me!