shengliu66/SOP

About the hyperparameter u,v

Opened this issue · 1 comments

Dear liu,

In Algorithm 1, it is written that u uses CE as the loss function to optimize, and v uses MSE. But in the code, u not only uses CE to optimize, but also uses MSE. Is there any conflict between the two?

In addition, it is said in the appendix that because the gradient w.r.t. of v does not depend on the output f(x,θ) of the model, v cannot correctly learn the label noise. Similarly, the gradient of u does not depend on the output of the model, so why can u use CE? What is the difference between u and v?

Looking forward to your reply!

Hi,

There is no conflict between the two because they optimize the same object, just with different losses. In practice, we later found that optimizing u and v with CE and MSE together also works well, so we updated our code which is slightly different than Algorithm 1.

The original reason for using MSE to optimize v is because, as we mentioned in the paper, the gradient with respect to v does not depend on the output of the model. To be more specific, the roles of u and v are quite different. u has a non-zero entry at the wrong class, and v should have a non-zero entry at the true class. This is related to your second question -- why u can use CE?

As if you calculate the gradient of u, it is masked by the one-hot label y, which has only one non-zero entry, so there is no need to know the output of the model to just increase the value at that entry. But for v, the gradient, as we mentioned in eq. A.2 is masked by 1 - y, which has many non-zero values, therefore we need the output.

Hope this answers your question.

Sheng