about the choice of p_th

Question

about the choice of p_th

Closed this issue 2 years ago · 9 comments

Awesome work, but a little question:

Therefore, we use a threshold $p_{th}$ to filter the images, and do not apply augmentations on those with high confidence. More specifically, we design $p_{th} = conf^S-\delta$, where $conf^S$ is the 5% quantile for the softmax predictions’ confidence on the source images from the source model $f_\theta$

In your code, 0.74 was the 5% quantile for cityscapes. So ,how is this value calculated?

Kind regards.

Answer 1 · 2022-11-28T09:52:09.000Z

Hi, as mentioned in the supplementary, it is computed based on the source pre-trained (no adaptation) model's confidence on the source dataset. You feed in the source images, and get a bunch of confidence numbers. You sort them and get the 5% quantile number $s$. Finally, $p_{th}$ is defined as $s-0.05$ to provide further buffer to avoid augmenting on already confident ones.

You will find that the number can differ quite a lot because of the datasets and the way the source network is trained. You can find the number for different datasets we pre-calculated in the config files in OPTIM: AP. For example, in the above Cityscapes example, we use $p_{th}=0.74-0.05=0.69$.

Answer 2 · 2022-11-28T10:02:47.000Z

Very clear answer.
Best.

Answer 3 · 2022-11-30T05:27:28.000Z

And a liitle question about the Stochastic Restoration in the code:

for nm, m  in model.named_modules():
      for npp, p in m.named_parameters():
           if npp in ['weight', 'bias'] and p.requires_grad:
              mask = (torch.rand(p.shape)<0.01).float().cuda() 
              with torch.no_grad():
                    p.data = anchor[f"{nm}.{npp}"] * mask + p * (1.-mask)

the file path is mmseg/apis/test.py.

So, can i wirten as :

for npp, p in model.named_parameters():
      if ('bias' in npp or 'weight' in npp) and p.requires_grad:
         mask = (torch.rand(p.shape)<0.01).float().cuda() 
         with torch.no_grad():
                p.data = anchor[npp] * mask + p * (1.-mask)

This seems to avoid a lot of unnecessary loops and please ignore those annoying spaces.

Answer 4 · 2022-11-30T10:58:30.000Z

Yes, you can. I think they are equivalent.

Answer 5 · 2022-11-30T11:02:08.000Z

Thanks your reply.

Answer 6 · 2022-12-15T09:08:02.000Z

Sorry, here I go again.
When I extended your method, I found that the prediction used in calculating miou is derived from the teacher model without updated, not the updated student model. Have you tried to calculate the final miou with the updated student model?

Answer 7 · 2022-12-15T13:03:31.000Z

Sorry, here I go again. When I extended your method, I found that the prediction used in calculating miou is derived from the teacher model without updated, not the updated student model. Have you tried to calculate the final miou with the updated student model?

The teacher model can be seen as a slower-updated average of the student model. Therefore, if you use the same hyper-parameters, using the student model instead could have a lower performance in the long-term than using the teacher model. However, it is possible that you can overcome this by adjusting the learning rate or the forgetting rate.

Answer 8 · 2022-12-15T14:12:22.000Z

But in the standard mean-teacher framework and variants, the network used to evaluate test performance is the student network. This seems like an interesting difference.

Answer 9 · 2022-12-15T14:45:46.000Z

Yes, this design choice is to keep the model change more smoothly in case of 1> sudden domain change 2> long term adaptation.