Vibashan/irg-sfda

The Teacher Network Update

kinredon opened this issue · 11 comments

I noted that the weights of the teacher network are updated every epoch? Usually, we update the teacher model every iteration. Why do authors choose such a strategy?

new_teacher_dict = update_teacher_model(model_student, model_teacher, keep_rate=0.9)

Also, the paper states the =keep_rate= for teacher update is set to 0.99, and the code here is set to 0.9.

Hi @kinredon , As this is a source-free setup, we have no access to source data; we have only access to a source trained model. To learn to target specific representation, we need to train the model on the unlabelled target domain using pseudo-labels. However, due to domain shift, the generated pseudo-labels are noisy and self-training on top of the noisy pseudo-labels leads to catastrophic forgetting. Hence we opted for the student-teacher framework. During our initial experiments, we updated the teacher model for every iteration; however, as the pseudo-labels are so noisy, the student model gets easily overfitted to the noise.
Further, due to ema, the noise from the student network is transferred to the teacher network for every iteration. Moreover, there is no supervision for the teacher network as we have no access to any labelled data. Thus after a few iterations, more noise gets transferred to the teacher network and essentially, performance gets lower then source only performance in some datasets. To avoid this, experimentally, we observed updating the teacher for each epoch works the best. This is because no noise is transferred from the student network for one epoch and in the meantime, the student network learns robust target representation.

Thanks for pointing out the typo; I will update it.

@Vibashan Thanks for your quick response. All my questions have been addressed. Thanks again.

Thanks @kinredon , if you have any more concerns, please feel free to contact me.

Thanks.

@Vibashan I have another question about contrastive loss, which is implemented here:

ss_loss = - (self.temperature / self.base_temperature) * ss_mean_log_prob_pos

I carefully read the code and the statement in the paper, and I find the implementation is different from the statement in the paper. Eq.(8) in the paper shows that the denominator is the sum of A(i), which has excluded the i.

Yes, "it is A(i), including proposal i "

Thanks a lot.

@Vibashan I was also confused by the construction of the graph.

adj = F.normalize(dot_mat.square(), p=1, dim=-1)

Why is the adj the L1 norm of dot(kv).square? does it have some advantages over Eq.(5) in the paper?

Our motivation is to utilize the graph network to understand the relationship between proposals and for a given proposal, we need to find its positive/similar proposal. Therefore, we need to model the relationship between positive pairs. In other to achieve this, we use the L1 norm, where the L1 norm provides sparsity while constructing the graph. In other words, the L1 norm sparsifying property ensures to prune out the non-correlated/negative proposal's relationship and focus more on positive proposals as training proceeds.

@Vibashan Yes, the L1 norm provides sparsity, but dot_mat.square() will destroy the relationship. For example, proposals i and j have similarity -2, k and v have similarity 2, and the adj values become the same, i.e., 4 after employ square.

Hi @kinredon , I am not able to understand your question. Can you please explain it a bit more?

@Vibashan The dot_mat is the matmul of the qx and kx.

dot_mat = qx.matmul(kx.transpose(-1, -2))

The value of dot_mat represents their relationship (similarity), but dot_mat.square() will destroy their relationship. Suppose there are vectors v1=[-1, -1], v2=[1, 1], v3=[1, 1], the dot_mat for v1 & v3 and v2 & v3 are -2 and 2, respectively. Obviously, v2 and v3 are quite similar with strong relationship while v1 and v3 are totally different. However, dot_mat.square() make the similarity of v1 & v3 and v2 & v3 are both 4, which destroy the relationship. Hopefully I clarify my question.

Hi @kinredon, I am sorry for the delayed response; I had an exam and got into some work. For a given anchor/proposal, we utilize the IRG model for mining corresponding positive proposals for contrastive learning. Thus by performing dot_mat.square(), the model is constrained to learn the better correlation to differentiate positive proposals from negative. Thus the negative proposal similarity scores are pushed toward zero and positive proposal similarity scores towards one.