Training Methodology

Question

Training Methodology

Closed this issue 3 years ago · 4 comments

Hi,

Firstly, Thanks for the code.

I had a small doubt with the training process, so while training, you call .backward(retain_graph=True) on eq. 6 and then again .backward() on eq. 12? As per what I understood, you optimize the descriptor and then try to make sure that results are same after detection with the regularization loss. Am I getting this correctly or the process is somewhat different?

Answer 1 · 2021-03-24T03:38:22.000Z

``As per what I understood, you optimize the descriptor and then try to make sure that results are same after detection with the regularization loss.'' -- That's right. However, in implementation, we do not call backward() twice, actually, we first calculate the final loss in eq. (12) and call backward() only once.

Answer 2 · 2021-03-24T09:48:01.000Z

Oh okay, but then how do you make sure the descriptor is optimized, I mean one would need to apply the gradients from eq 6 initially right? So that with the 2nd pass (the pass to make sure detections are the same) happens, we have an optimized descriptor. Do you have different optimizers for the descriptions and detections?

Answer 3 · 2021-03-24T11:32:15.000Z

You should regard this as a multi-task learning process.

The first task is to optimize the descriptor, which will update the backbone network N_b and the descriptor branch N_des. This task is performed by minimizing the loss L_des which is eq. (6).

The second task is to maintain the detector unchanged. As the detector branch N_det relies on the backbone N_b, if we do nothing, the detection output will change unexpectedly. This task is performed by minimizing the loss L'_det which is eq. (10).

Under the multi-task learning framework, we can minimize both these two losses as in eq. (12). For our case, it is easy to perform this multi-task learning process using the SGD method, as $\frac {\partial L_1}{\partial w} = \frac{\partial L_{des}}{\partial w} + \alpha \frac{\partial L'_{det}}{\partial w}$ (this is an in-line latex equation), where w is the weight. In the implementation, we can implement this using a modern deep-learning framework easily. For example, we use pytorch, you can let loss1 = loss_des + alpha * loss_p_det, loss1.backward(), optimizer.step(). Acctually, in pytorch, if you call backward() more than one times, the gradient of each weight will be accumulated, so the code also can be loss_des.backward(), loss_p_det = alpha * loss_p_det, loss_p_det.backward(), optimizer.step().

Answer 4 · 2021-03-24T12:31:37.000Z

Thanks a lot for your quick and detailed response, this solves my problem.