w1oves/Rein

Resizing During Training and Eval

vivekvjk opened this issue · 9 comments

Hi! I noticed that the train pipeline for dg on gta-->cv train ons 512x512 crops on a downsampled gta image (1280, 720). However, during evaluation on cityscapes, you are evaluating on 512x512 crops on a downsampled cityscapes image (1024, 512).

Was this intended, as evaluation should occur on the original image size for cityscapes (2048, 1024)>

Thank you for your interest in our work! The question you raised has already been considered in our experimental setup. Since our model was trained on lower-resolution images, it has not been exposed to high-resolution ones, making it unnecessary to test on high-resolution images. Of course, while we did not conduct related experiments, we speculate that the testing performance would improve with high-resolution images, as they contain more information than their lower-resolution counterparts. However, we believe that this change would have a minor impact on performance, which is not central to our core contributions. To facilitate easier evaluation of the model, we chose to assess it on low-resolution images of 1024x512.

Thank you for the quick reply! When reporting other methods and comparing to yours, do you evaluate them under the same setting (at downsampled size)?

If you're still seeking answers or further clarification, we encourage you to explore our latest checkpoints. Aimed at enhancing real-world applicability and showcasing the exceptional capabilities of our approach, we meticulously carried out two experimental series: synthetic-to-real and cityscapes-to-acdc. We've made available the corresponding checkpoints at Cityscapes and UrbanSyn+GTAV+Synthia, both of which have demonstrated remarkable results. To ensure peak performance, these configurations were rigorously trained and tested using their native resolutions. For usage instructions, refer to the discussion here.

Thank you for the quick reply! When reporting other methods and comparing to yours, do you evaluate them under the same setting (at downsampled size)?

In our paper, the performance metrics for other methods were sourced directly from their original publications. Since the PEFT and DGSS methods mentioned in Tables 2 and 3 were not adapted for VFMs, we replicated them under the configurations previously described. In other words, every metric presented in each table either comes from its original publication or is obtained under configurations that are strictly identical to those used for our method in the same table.

Hi! I noticed that the train pipeline for dg on gta-->cv train ons 512x512 crops on a downsampled gta image (1280, 720). However, during evaluation on cityscapes, you are evaluating on 512x512 crops on a downsampled cityscapes image (1024, 512).

Was this intended, as evaluation should occur on the original image size for cityscapes (2048, 1024)>

Hi ! I have two questions regarding the question above :

  • I would like to ask more about the cropping to 512x512 during evaluation ? Normally it is not done in other DGSS methods.
  • I tried to run the code with EVA02 and evaluate on images of Cityscapes of size 1024x2048, but I got an error which forces me to have images of size 512x512 during inference too, which doesn't make sense.

Thank you for your help

  • Why do we use downsampled + cropped images during evaluation?
    We train the model with downsampled + cropped images to ensure that the model encounters image features during evaluation that are consistent with those during training.

  • Do we apply this configuration to other DGSS methods?
    Performance metrics for all DGSS methods, except in Tables 2 and 3, are sourced from their original publications to highlight Rein's advancements. For Tables 2 and 3, we applied this configuration to other DGSS methods to fairly demonstrate Rein's effectiveness. We are confident that Rein's superiority and effectiveness are not due to this configuration difference.

  • Why don't other DGSS methods use this configuration, but we do?
    Most existing DGSS methods are based on convolutional networks, which have translational invariance and strong scale invariance, thus making them less sensitive to changes in image size. Our VFMs architecture is Transformers, which do not possess these properties theoretically, leading us to adopt the current configuration.

  • Why can't EVA02 use full-size images for validation?
    EVA02's code restricts the input image size, which can be addressed by modifying the code. More fundamentally, the Transformer architecture has a fixed number of positional encodings. When the image size changes, the image embeddings cannot match the original positional encodings. This can be solved by interpolating positional encodings, see tools/convert_models. Due to the excellent performance of DINOv2, our focus is on it, and we do not plan to support evaluation of EVA02 on 1024x2048 resolution images. We would appreciate any support in this regard.

  • How can to validate on 1024x2048 resolution images?
    We have provided validation for DINOv2 on 1024x2048 resolution images. See #4 for reference.

Thank you very much for all your answers !! This is really helpful

you said that you provided validation for DINOv2 on 1024x2048 resolution images, but in the link you provied, it's for 1024x1024. Am I missing something ? or is it just a typo ?

Thank you !

During training, images will be resized to 1024x2048 and then cropped to 1024x1024. For validation, a sliding window of size 1024x1024 is used on the 1024x2048 images. Hence, I refer to it as a checkpoint at 1024x2048.

Thank you very much for your help !