Great work! I want to konw when using DINOv2 as the backbone, what are the training image size and the testing image size, respectively? And what is the accuracy on the MSLS challenge? Looking forward to your reply!
STSTERANDMOMO opened this issue · 7 comments
Hello @STSTERANDMOMO,
Thank you for your interest!
- When using DinoV2 as the backbone, we trained with images resized to 280x280 and tested with 322x322.
- I haven't tested on MSLS Challenge yet, but I'll do it now and share the performance on the repo README.
Hi again @STSTERANDMOMO
The performance on MSLS challenge is as follows:
──────────────────
MSLS-Challenge
──────────────────
R@1 79.0
R@5 90.3
R@10 92.0
R@20 93.7
they will soon appear on the leaderboard here https://codalab.lisn.upsaclay.fr/competitions/865#results
So great results! And I want to konw what's the performance on Tokyo24/7? I want to compare your work with the recent CricaVPR and SALAD, focusing on the advantages and disadvantages.
@STSTERANDMOMO
I haven't tested BoQ on the Tokyo dataset yet, but I may do so in the future.
Note: CircaVPR leverages the relationships between images in the batch during test time, which gives it an edge in benchmarks involving sequences. This is not the case for BoQ or SALAD.
@STSTERANDMOMO I haven't tested BoQ on the Tokyo dataset yet, but I may do so in the future.
Note: CircaVPR leverages the relationships between images in the batch during test time, which gives it an edge in benchmarks involving sequences. This is not the case for BoQ or SALAD.
However, both CricaVPR and SALAD train their model with image resolution 224×224,leading less training time. This is not to be ignored.
In my opinion, if you opt for DinoV2 instead of ResNet50 backbone, then training time should be the least of your problems.
One epoch of training DinoV2-BoQ on an RTX 8000 (which is way slower than an RTX 3090) is 20min per epoch (using 280x280 images). Convergence is achieved before epoch 20. This is because BoQ is composed of highly optimized PyTorch computations (mainly self-attention layers). The architecture of the aggregator plays a role in how much memory and time the entire network consumes.
Note: BoQ was published in CVPR 2024 at the same time as SALAD and CircaVPR. Which is why you don't see a comparison between the three methods in the same paper. Also, we tried training on 320x320 images, but there was no difference in performance. We didn't try with 224x224. Someone will/may eventually do it.
In my opinion, if you opt for DinoV2 instead of ResNet50 backbone, then training time should be the least of your problems.
One epoch of training DinoV2-BoQ on an RTX 8000 (which is way slower than an RTX 3090) is 20min per epoch (using 280x280 images). Convergence is achieved before epoch 20. This is because BoQ is composed of highly optimized PyTorch computations (mainly self-attention layers). The architecture of the aggregator plays a role in how much memory and time the entire network consumes.
Note: BoQ was published in CVPR 2024 at the same time as SALAD and CircaVPR. Which is why you don't see a comparison between the three methods in the same paper. Also, we tried training on 320x320 images, but there was no difference in performance. We didn't try with 224x224. Someone will/may eventually do it.
Thank you for your reply! I learned a lot from our communication! No matter what, your performance has been so excellent, and I look forward to your even more outstanding work in the future. Thank you very much!