How to calculate the gradient of log p(H;w) about the parameter w?
Closed this issue · 4 comments
By reading the source code of ngransac, I found that the calculation of
is the number of times of each element of correspondence is picked at random. Can you explain the theory behind the calculation in detail? Because it is hard for me to draw the conclusion. Thank you in advance! @ebrach
Hi,
sorry for the delay! Always happy to answer questions about the theory :)
The network predicts log probabilities, and the gradients that we calculate are with respect to the output of the network. That means in the equation above, we do not differentiate wrt to w (network parameters) but wrt log p. The derivative of log p wrt log p is 1. That is why we have a one for any correspondence that has been picked. If it was picked multiple times, we sum those 1s up.
The remaining differentiation (log p wrt w) is handled by the deep learning framework.
Best,
Eric
Thanks for your kind reply very much. Now, I think I may understand some parts as follows: According to the chain rule, we can derive the following equation :
In resource code, we calculate the value of :
For the value of
, is there a detailed derivation ? or how to explain it in theory? Intuitively, for some correspondence picked up, the value is one, and sum up those 1 if the same correspondence sampled by many times. But I feel I can not get it thoroughly.
Thank you again!
I hope my scribbles are useful ;( According to Eq. 2 of the paper, we can decompose the log probability of an hypothesis pool p(H) into the sum of probabilities for picking individual data points log p(y_i) that constitute the models minimal set . This is what the network predicts. For F/E matrix fitting, one datapoint is one image correspondence, and the network predicts one sampling weight.
Then, we go to Eq. 5, and instead of deriving it wrt the network parameters w, we derive it wrt log p(y_i), i.e. wrt the log probabilities of each individual data point. We substitute p(H) by the sums of p(y_i) according to Eq. 2.
What happens now is the following:
a) log p(y_i) derived wrt to log p(y_i) is one
b) but only if it is actually part of the hypothesis in the pool
That is why we assemble gradients by counting the correspondences we sample according to the prediction of the network.
Thank you very much for your kindly and detailed interpretation. I have understand it thoroughly. Thank you again!