Reproducing mmseg/ade20k results
collinmccarthy opened this issue · 16 comments
Hello,
Looks like the crop size is incorrect in the RADIO ADE20K config.
It's listed as (518, 518) here but it should be (512, 512) like it is in E-RADIO and the ADE20K base config. Using (518, 518) causes the patch projection to fail.
Thanks,
-Collin
Hello, for RADIO the crop size was set to 518 as this was the nearest multiple of the RADIOv1 patch size of 14.
However now that we have released the patch-size-16 RADIOv2 you're right that we should update the MMSEG crop size to 512.
Yes, shortly after posting that I noticed the RadioV1 model had the different patch size. Maybe the better approach here is to choose a fixed crop size for all models, e.g. (512, 512) then use the pad_size_divisor
option of the data pre-processor to pad to the nearest multiple of the patch size. I can propose something more specific towards the end of the week.
Btw, have you seen that we just released RADIOv2.5 ViT-B and L models? If not, check out the tech report. They may very well work better for your use case.
@mranzinger Yes I did see that, very excited to try them out, thanks!
@mranzinger Is there a way to specify what version I want from HF? Specifically, what do I pass as the repo_id
for AutoModel.from_pretrained(repo_id)
to specify each of the following versions in radio.common.RESOURCE_MAP
?
- Version "radio_v2.5-b"
$\rightarrow$ repo_id="nvidia/RADIO-B"
? - Version "radio_v2.5-l"
$\rightarrow$ repo_id="nvidia/RADIO-L"
? - Version "radio_v2.1"
$\rightarrow$ ?? - Version "re-radio_v2"
$\rightarrow$ ??
Or do I have to use torchhub to specify these versions (and therefore modify radio.mmseg.radio.py
to use this API instead)?
Edit: Sounds like I might be able to specify the revision
argument, but I'm not sure if that works with how things are setup, or what to specify for the older versions. Basically I'm just trying to setup things so that when updates are made to the HF repos I can keep using the previous models if I want to.
@gheinrich is the expert on the HFHub code. But I believe you want https://huggingface.co/nvidia/RADIO-L, so yes, I think you're right with your repo_id
map. For radio_v2.1 -> repo_id="nvidia/RADIO"
and e-radio_v2 -> repo_id="nvidia/E-RADIO"
.
I may be ignorant about all of the features of HFHub, but I will say that it's been easier to deal with model versions and releases using TorchHub.
Okay great, thanks. I'll give this a shot and let you know if it works.
The mapping below appears to be correct (the same one discussed above). It would be nice if this is in radio.common.RESOURCE_MAP
for future versions, but not a huge deal.
HF_RESOURCE_MAP = { # Version key in RESOURCE_MAP to HuggingFace repo id
# RADIOv2.5
"radio_v2.5-b": "NVIDIA/RADIO-B",
"radio_v2.5-l": "NVIDIA/RADIO-L",
# RADIO
"radio_v2.1": "NVIDIA/RADIO",
"radio_v2": None,
"radio_v1": None,
# E-RADIO
"e-radio_v2": "NVIDIA/E-RADIO",
}
However, it looks like the HF vs. torchhub models for RADIO v2.1 and E-RADIO v2.0 have some slight differences. Not sure if this is expected or not, or if it's a big deal for anyone and worth fixing, but it's failing some tests that I have (as well as test_hf.py
) so I thought I'd share.
For RADIO v2.1, the HF model at NVIDIA/RADIO
has summary_idxs=tensor([0,2])
while the torchhub model has summary_idxs=tensor([0,1,2,3])
. For my application this is fine, but this fails the test_hf.py
test because the summary tensor is different. But the features output tensor and all other buffers/parameters/submodules are the same.
For E-RADIO v2.0, the HF model at NVIDIA/E-RADIO
has summary_idxs=tensor([0,1,2])
while the torchhub model has summary_idxs=tensor([0,1,2,3])
again. It also looks like the relative_bias at model.levels.N.blocks.M.attention_blocks.K.attn.pos_emb_funct.relative_bias
is slightly different. The values are extremely close, e.g. 6.2875 to 6.2899, but I'm not sure why they're different or if this matters. I am struggling to reproduce the E-RADIO ade20k results, so it could be related to this, but my guess is these are close enough that it doesn't matter too much. I'll try switching to the torchhub version and see if it fixes it.
I'm not sure if these things were this way before the v2.5 push either, I wasn't testing things in this way before.
The summary_idxs
tensor is correct for the HF model actually. I'll need to look into why you're getting [0, 1, 2, 3]
for TorchHub. I assume you've run force_reload=True
recently since you've already fetched the v2.5 models, right? I thought I fixed it, but I suppose I'll need to check again 😅
How are you using the HF_RESOURCE_MAP
you're showing? Is it just for reference, or are you using as part of an API that we should consider upstreaming into main?
For E-RADIO, how different are your ADE20k results versus our reported?
For the torchub models I'm directly calling torch.hub.load_state_dict_from_url
using the url in the resource map. So any issue with the two torchhub models I mentioned is with the models at those URLs specifically (edit: I also deleted the checkpoint cache and double checked, same thing). When I used test_hf.py
just to see if that was any different (it wasn't, same discrepancies), I just manually deleted my ~/.cache/torch/hub/<radio>
directory because it wasn't overwriting everything from scratch when just specifying force_reload=True
.
I'm using the HF_RESOURCE_MAP
as part of a script I wrote to generate keyword arguments for a version of RADIO that doesn't rely on pretrained checkpoints to create the model. The script compares my version to the torchhub and HF versions, and checks for any mismatched buffers/parameters/submodules. Basically I needed (maybe 'wanted' is a better word) to build RADIO models from just a config file with plain data types, and then separately load the weights using just load_state_dict_from_url
. Constructing the models this way allows me to create model wrappers or extend the model classes in the easiest and most readable way. This is mostly beneficial when fine-tuning RADIO, especially when trying to integrate it into things like mmdetection where managing a ton of config files needs to be done carefully, and using explicit keyword arguments helps a lot.
My best results on ADE20K for E-RADIO were 22.7 mIoU. I don't have 8 GPUs so I'm using 2 GPUs w/ BS=8. I tried forcing sync_bn=torch
(best mIoU) and leaving the default sync_bn=None
(best aAcc), using both fp32 (works) and amp fp16 (didn't work at all). In comparison, I got ~51.1 mIoU for RADIO v2.1, and the results were almost identical for fp32, amp w/ fp16, with or without sync_bn (which makes sense since I think only the head uses BN and it's already configured as SyncBN). I'm training RADIO v2.5-B and v2.5-L and it looks like v2.5-L is definitely better than v2.1. Too early to tell on v2.5-B, but it looks to be very good and I'm very impressed with how close it is given it has less than 100M parameters.
I forgot, I made two other changes to the E-RADIO config. Using test_cfg=dict(mode='slide')
was timing out in the last few val iterations on GPU 0, leading to some weird issues with broadcasting the tmpdir dir during eval (similar to this mmengine issue) . Using mode='whole'
fixed this but I had to pad the image to multiples of 32. I've attached the config file I had to use (with a dummy .txt extension for GitHub) where the only places I've updated it are marked with # UPDATED
.
Hello Collin, in some earlier experiments I ran, I had found "slide" to work better than "whole":
That alone should not explain the discrepancy you're seeing.
I understand you are instantiating the model and loading the weights manually, is this correct? Can I check with you that the E-RADIO model you are instantiating is a "window size 16"? The window size can have a dramatic impact on mIoU, I found 16 to be optimal. For example I would see a boost of +9 on mIoU going from window size 7 to 16 (without re-training).
If that doesn't help, I suggest looking at inference visualizations, you can enable them with something like this in your config:
default_hooks = dict(
visualization=dict(type='SegVisualizationHook', draw=True, interval=20))
Hi Greg, very interesting, thanks for sharing! I'll have to look into the "slide" implementation in more detail, I just assumed it would crop/stitch the output to give exactly the same results. I'm not sure why it was causing issues for me, after I figured out that "whole" fixed the issue I just moved on.
For all of the results I've discussed so far, I'm using the exact same mmseg setup as in this codebase. All I've changed is the config file I posted above to fix the issue, and switched to using 2x GPUs with a batch size of 8. I've attached the config I get from the following
hf_model = AutoModel.from_pretrained("NVIDIA/E-RADIO", trust_remote_code=True)
print(pprint.pformat(hf_model .config, sort_dicts=False))
It looks like the config has vitdet_window_size: None
. On top of that, VitDetArgs
is imported in radio.hf_model.py
but never used. Also, the torchhub checkpoint at 'https://huggingface.co/nvidia/RADIO/resolve/main/eradio_v2.pth.tar?download=true' does not have vitdet_window_size
in checkpoint['args'], so it looks like that needs to be manually passed into hubconf.radio_model() or added afterwards via apply_vitdet_arch()
.
I'll try training some models with my torchhub-based API while passing in vitdet_window_size=16 and see if that fixes it.
While I'm doing that, is it possible for you or Mike to try to fine-tune E-RADIO and see if it works with the current codebase? If you don't have time no worries, I'll try to get it working for a bit longer and then use RADIO v2.5-B instead.
Sorry, I see what's happening here with vitdet_window_size
. ViTDetArgs is only used for the normal ViT models, not E-RADIO, and the window size is already set to 16 during the timm model registration, e.g. with create_model(model='eradio')
.
One more hint as to what might be going on. When calling _, features = self.base_model(x)
here the features returned always have shape B, N, C
and need to be re-shaped. I used self.patch_size
which is 16 for E-RADIO which produced the same feature vector as directly calling ERADIO.forward_features()
. But given that I had to add this code, maybe something else changed in that forward_features() implementation, like I shouldn't actually be taking the norm, or I shouldn't be using the high-res neck, or something.
Edit: The issue only appears to be during inference/validation. The training loss and decode.acc_seg
seem to be very close to RADIO v2.1 and v2.5
So I'm not sure what the issue is with E-RADIO and mmseg. Sorry I can't narrow down the issue more. I'll stick with RADIO v2.5-B as my "efficient" model for now and if you want me to submit a PR just so you can see the changes I've made so far to try and get it to work, I'm happy to do that. Thanks.
PRs are always welcome. The ViT-B model indeed seems to be really high quality, so I hope you can make some progress with it. I'm actually generally interested in what you're up to if you'd be willing to shoot me and/or Greg an email with details, if you're interested.
Sounds great, I'll email you and Greg to sync up on how I'm using this and we can discuss the PR then. Thanks!