VicenteVivan/geo-clip

Ten Crop benchmark

Closed this issue · 3 comments

t0tl commented

@VicenteVivan It is mentioned in the paper that a ten crop method is taken to evaluation where you average your prediction over all these 10 cropped images. How do you perform the averaging?

For example, you could average the predicted GPS coordinates or you could average the embeddings before you evaluate. Both methods will give very different results. Thankful for any answer ^^

Hi @t0tl,

Thank you for reaching out. We don't average the GPS coordinates or the embeddings of the multiple crops, but the logits of the images – i.e. the similarity vectors of each image with all the GPS coordinates in the gallery. For example, given a batch of 10 images where each image is a different crop, you would get the prediction as follows:

from einops import rearrange

# ...

with torch.no_grad():
    logits_per_image = model(imgs, locations)

if tencrop:
    logits_per_image = rearrange(logits_per_image, '(bs crop) classes -> bs crop classes', crop=10)
    logits_per_image = logits_per_image.mean(dim=1)

probs = logits_per_image.softmax(dim=-1)

Please, let us know if you have any other questions.

Sincerely,
Vicente

t0tl commented

Thanks for the quick reply! That approach definitely makes the most sense. Would you also mind expanding more on the reason for choosing Ten Crop as a evaluation method? Is it a standard within computer vision?

Primarily, we decided to use TenCrop during evaluation because the previous state-of-the-art models had also used it to improve performance (Pramanick et al. (2022) & Clark et al. (2023)). Thus, we included it for consistency and comparability.