qinenergy/cotta

about the choice of p_th

Closed this issue · 9 comments

Awesome work, but a little question:

Therefore, we use a threshold $p_{th}$ to filter the images, and do not apply augmentations on those with high confidence. More specifically, we design $p_{th} = conf^S-\delta$, where $conf^S$ is the 5% quantile for the softmax predictions’ confidence on the source images from the source model $f_\theta$

In your code, 0.74 was the 5% quantile for cityscapes. So ,how is this value calculated?

Kind regards.

Hi, as mentioned in the supplementary, it is computed based on the source pre-trained (no adaptation) model's confidence on the source dataset. You feed in the source images, and get a bunch of confidence numbers. You sort them and get the 5% quantile number $s$. Finally, $p_{th}$ is defined as $s-0.05$ to provide further buffer to avoid augmenting on already confident ones.

You will find that the number can differ quite a lot because of the datasets and the way the source network is trained. You can find the number for different datasets we pre-calculated in the config files in OPTIM: AP. For example, in the above Cityscapes example, we use $p_{th}=0.74-0.05=0.69$.

Very clear answer.
Best.

And a liitle question about the Stochastic Restoration in the code:

for nm, m  in model.named_modules():
      for npp, p in m.named_parameters():
           if npp in ['weight', 'bias'] and p.requires_grad:
              mask = (torch.rand(p.shape)<0.01).float().cuda() 
              with torch.no_grad():
                    p.data = anchor[f"{nm}.{npp}"] * mask + p * (1.-mask)

the file path is mmseg/apis/test.py.

So, can i wirten as :

for npp, p in model.named_parameters():
      if ('bias' in npp or 'weight' in npp) and p.requires_grad:
         mask = (torch.rand(p.shape)<0.01).float().cuda() 
         with torch.no_grad():
                p.data = anchor[npp] * mask + p * (1.-mask)

This seems to avoid a lot of unnecessary loops and please ignore those annoying spaces.

Yes, you can. I think they are equivalent.

Thanks your reply.

Sorry, here I go again.
When I extended your method, I found that the prediction used in calculating miou is derived from the teacher model without updated, not the updated student model. Have you tried to calculate the final miou with the updated student model?

Sorry, here I go again. When I extended your method, I found that the prediction used in calculating miou is derived from the teacher model without updated, not the updated student model. Have you tried to calculate the final miou with the updated student model?

The teacher model can be seen as a slower-updated average of the student model. Therefore, if you use the same hyper-parameters, using the student model instead could have a lower performance in the long-term than using the teacher model. However, it is possible that you can overcome this by adjusting the learning rate or the forgetting rate.

But in the standard mean-teacher framework and variants, the network used to evaluate test performance is the student network. This seems like an interesting difference.

Yes, this design choice is to keep the model change more smoothly in case of 1> sudden domain change 2> long term adaptation.