fcdl94/WILSON

pretrained weights

Andylijinpan opened this issue · 8 comments

i have downloaded the [ResNet101v1, InPlace-ABN sync] weight, which is called resnet101.pth.tar.
But i found the weight click 'Download' you provided in Readme is called resnext101_ipabn_lr_512.pth.tar.
And the code need resnet101_iabn_sync.pth
So, which one is correct?

Hello, sorry for the error.
We used resnet101_iabn_sync.pth, there is an error in the link in the Download button.

Hello, i have a question about the experimental results.

i run the task15-5, after learning the 5 classes at step 1. the mIoU of 16-20 classes is 33.9%.
I want to know if there is a problem with the setup, besides the weakly-- change it to True when step >0. Is there anything else that needs to be modified in your code.

Lastly, how can I evaluate the performance of the model? Because i find after running step 1, the code did not evaluate all the classes (step 0 and step 1).

Looking forward to your reply, thanks.

fcdl94 commented

Dear @Liuhao-128,

It's a while I don't run the code, but I'm pretty sure it evaluate on all the classes after step 1.

It is strange you don't get good performances. Can you please provide the script you're using for training?

I have the similar problem. When I ran the task 10-10, the mIou of step 0 was 57% and 55% for step 1. My script is following:

python -u -m torch.distributed.launch --nproc_per_node=1 run.py --num_workers 1  --name Base --step 0 --lr 0.005 --bce --dataset voc --task 10-10 --batch_size 12 --epochs 30 --val_interval 2

and

python -u -m torch.distributed.launch --nproc_per_node=1 run.py --num_workers 2  --name 10-10_disjoint --step 1 --weakly --lr 0.0005 --alpha 0.5 --step_ckpt checkpoints/step/voc-10-10/Base_0.pth --loss_de 1 --lr_policy warmup --affinity --dataset voc --task 10-10 --batch_size 12 --epochs 40

I also tried to increase the lr, but seems the learning process of step 0 always stopped at 57% and hard to be improve.

fcdl94 commented

Hey @adaxidedakaonang !

My first thought is regarding the batch size. I was using 24, while in your commands you're using 12, without changing learning rate or number of epochs. If you have enough resources, you may try with BS=24, otherwise, you may use BS=12 doubling the number of epochs and halving the learning rate (however, I don't guarantee it'll replicate my results, that were run with a different config).

Hope it helps.

For now I only halve the lr, then I will try to double the epoch.

Sorry for the bother again, but I would like to know whether you have the trained model at step 0, 10-10? I use BS=24 and for the offline model it gave me the mIoU 76.6%, which is good enough, but for the 10-10 task, I got 57% and 55% respectively, I wonder if anyone can upload trained model at step 0? Thank you very much.

fcdl94 commented

Looking at the previous scripts, it seems that you are running the job with lr=0.0005, while I used 0.001, as you can see here.

I'm really sorry but I don't have the checkpoints since I changed position and I've no longer access to the workstations used for this paper.