In DST, why confidence thresholds are different among baselines?

Question

In DST, why confidence thresholds are different among baselines?

machengcheng2016 opened this issue a year ago · 4 comments

machengcheng2016 commented a year ago

Greetings!
I've been studying your wonderful work DST, recently. I notice that you set different threshold to various baseline methods, which seems like an unfair experimental setting. I wonder why?
Thanks!

Answer 1 · 2023-05-31T08:32:25.000Z

Hello, this is because we find most baseline methods are sensitive to the choice of confidence threshold. For example, when setting threshold to 0.7, FixMatch will fail to improve over labeled-only training on several datasets due to error accumulation of pseudo labeling. So for each baseline method on each dataset, we will search for the optimal threshold to enable fair comparison. In my opinion, this is like if you want to fairly compare ViT with ResNet, the learning rate can be different because the optimal choice is different.

Answer 2 · 2023-05-31T08:34:28.000Z

Besides, we find DST is less sensitive to this choice. Hope this answers your question.

Answer 3 · 2023-05-31T20:54:42.000Z

Thanks for your reply!
On CIFAR-10, with supervised pre-trained model, I got the performance of FixMatch as:
threshold acc
0.7 74.7
0.8 84.7
0.9 74.7
0.95 66.7
I am sure that I only change the threshold among these four runs. I wonder the reason why 0.8 leads to 10% higher accuracy.
I understand what you replied that DST is less sensitive to threshold setting, since the backbone weights are samely initialized. But do you think that the initialization of last few layers (classification layer) can make such difference on performance?

Answer 4 · 2023-06-15T02:19:15.000Z

Sorry for the late reply. Confidence threshold might be the most important hyper parameter for SSL methods. Taking your example here, selecting optimal one (0.8 84.7) or suboptimal one (0.95 66.7) can have great impact on the final performance, especially when we have only few labeled samples (40 for CIFAR-10). For your last question, this phenomenon is aligned with our finding that the classifier head (last few layers) is likely to accumulate pseudo-labeling error. And through backward process, the error can propagate to backbone parameters, which gradually leads to large difference on performance.