How do you calculate the NLL loss?
fengxin619 opened this issue · 44 comments
Could you tell me where does the formula come from?
We are calculating the -log-likelihood of the ground truth under the the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution
We are calculating the -log-likelihood of the ground truth under the the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution
I found your calculation
out = -(torch.pow(ohr, 2) * (torch.pow(sigX, 2) * torch.pow(x - muX, 2) + torch.pow(sigY, 2) * torch.pow(y - muY,2) - 2 * rho * torch.pow(sigX, 1) * torch.pow(sigY, 1) * (x - muX) * (y - muY)) - torch.log(sigX * sigY * ohr))
is not matched with wikipedia:
Could you answer my doubts?
This is the correct computation:
eps_rho = 1e-6
ohr = 1/(np.maximum(1 - rho * rho, eps_rho)) #avoid infinite values
out = 0.5*ohr * (diff_x * diff_x / (sigX * sigX) + diff_y * diff_y / (sigY * sigY) -
2 * rho * diff_x * diff_y / (sigX * sigY)) + np.log(
sigX * sigY) - 0.5*np.log(ohr) + np.log(np.pi*2)
sigma values were inverted wich may be changed afterward by replacing output activations for sigma from exp(x) to exp(-x) but this does not affect the results if it is done consistently.
There is an error though: the 0.5 factor and the constant value np.log(np.pi*2) were forgotten.
The constant value does not affect the gradients so it is fine for the learning phase. However it should be fixed for the evaluation.
We are calculating the -log-likelihood of the ground truth under the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution
Actually the formula should be corrected by adding (+log(2pi)) for the NLL loss. It may not affect much in the learning phase but results in a deviation of log(2pi) in evaluation.
Of course, it's fair for comparing different methods. But, if we have a good enough prediction, then the NLL metric may finally less than zero, that's not right from the definition of "negative log-likelihood"
NLL values can have negative values as the likelihood is not a probability and can have values greater than 1. Moreover, the 0.5 factor is another error, that is not only the addition of log(2pi).
We are calculating the -log-likelihood of the ground truth under the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution
Actually the formula should be corrected by adding (+log(2_pi)) for the NLL loss. It may not affect much in the learning phase but results in a deviation of log(2_pi) in evaluation.
Of course, it's fair for comparing different methods. But, if we have a good enough prediction, then the NLL metric may finally less than zero, that's not right from the definition of "negative log-likelihood"
hi,do you get the same rmse results as described in this paper(eg.5s 4.37m)?
We are calculating the -log-likelihood of the ground truth under the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution
Actually the formula should be corrected by adding (+log(2_pi)) for the NLL loss. It may not affect much in the learning phase but results in a deviation of log(2_pi) in evaluation.
Of course, it's fair for comparing different methods. But, if we have a good enough prediction, then the NLL metric may finally less than zero, that's not right from the definition of "negative log-likelihood"hi,do you get the same rmse results as described in this paper(eg.5s 4.37m)?
Hi, Zhanghm1819. what rmse did you get? my evaluation mse loss is about (57), whose squre root is about 7.5. after one epoch.
We are calculating the -log-likelihood of the ground truth under the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution
Actually the formula should be corrected by adding (+log(2_pi)) for the NLL loss. It may not affect much in the learning phase but results in a deviation of log(2_pi) in evaluation.
Of course, it's fair for comparing different methods. But, if we have a good enough prediction, then the NLL metric may finally less than zero, that's not right from the definition of "negative log-likelihood"hi,do you get the same rmse results as described in this paper(eg.5s 4.37m)?
Yes. The results are quite closed to what is reported in the paper if you use the author's implementation. BTW, the NLL loss should be corrected in training and evaluation as @jmercat suggested. (The correction of loss function does not affect the final RMSE much though )
There is a unit problem too, the NLL is not unitless and should be computed in meters as is done with the RMSE not in feet. But indeed there is not much impact on RMSE (which in my view is not a good error measure though).
Thanks @jmercat for pointing out the bug. Yes, the NLL expression needs two updates, a constant term added and a factor of 2. The RMSE values do not change significantly, nor do the trends in the NLL values. However, the actual values of NLL in the results table need an update.
The units issue seems trickier in my opinion.
The likelihood in this case can be assigned a unit (say, meters^(-1) or feet^(-1)) depending on what we're using to represent det(Sigma)^(-0.5).
However assigning a unit to log-likelihood wouldn't really make sense, as log(1 meter) or log (1 foot) makes no physical sense.
Depending on the unit we use to represent det(Sigma)^(-0.5), all NLL values will get offset by some constant. I can't see a clean way to get around this other than being consistent for all models being compared. It's also another reason why I wouldn't attach too much meaning to the actual value of the negative log likelihoods, and just use the metric for comparison.
I guess you are right that the unit is not really interpretable and it is only a constant offset but for comparisons with other datasets that are mostly in meters and for consistency with the RMSE I think that everything should be computed in meters.
utils.py has been updated with the two changes. Thanks again @jmercat
We are calculating the -log-likelihood of the ground truth under the the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution
I found your calculation
out = -(torch.pow(ohr, 2) * (torch.pow(sigX, 2) * torch.pow(x - muX, 2) + torch.pow(sigY, 2) * torch.pow(y - muY,2) - 2 * rho * torch.pow(sigX, 1) * torch.pow(sigY, 1) * (x - muX) * (y - muY)) - torch.log(sigX * sigY * ohr))
is not matched with wikipedia:
Could you answer my doubts?
Could you please link the source for this equation? Thank you!
It is using the same equation. Note that we're taking the negative of the log of f(x,y) shown above. Also sigX and sigY output by the model are the reciprocals of the standard deviations.
It is using the same equation. Note that we're taking the negative of the log of f(x,y) shown above. Also sigX and sigY output by the model are the reciprocals of the standard deviations.
Thanks for the reply. I think I have most of this figured out, but I have two questions (both may stem from ignorance/a fundamental misunderstanding of the basic concepts).
-
In the output activation, why is standard_devX,Y= e^(sigX,Y)? Reading Graves, I'm not sure I understand why to do this.
-
Is the standard deviation used in the maximum likelihood supposed to come from the "real" variables (the actual future trajectories) rather than the standard deviation of the predicted variables? If so, then why use a model output for standard deviation? Wouldn't that just be based on the prediction rather than the "real" variables?
Thanks again.
Following up this interesting topic, I have another concern:
Does trhe output of the decoder ensure that the predicted covariance matrix of the future trajectory time-steps is positive semi-definite?
Given the predicted:
it is ensured that?:
Shoulnd't be clipped to [-1, 1]?
Shoulnd't be > 0?
Shoulnd't be > 0?
Even if we ensure that the covariance matrix is positive semi-definite, how we deal with singular covariances: when or ? In such cases, won't the NLL be NaN?
Good questions
Sigma > 0 is insured by exponential activations, rho in [-1, 1] by tanh activations
As for the stability issues when |rho| is close to 1, I have questionned this and found a simple solution in my PhD that will be released soon. I will post a link here then.
Thanks for your interest.
It is modeling both. This is also a result that I show. With a trained model that predicts only one mode as a Gaussian, the covariance of the error computed on many sequences is almost equal to the average predicted covariance that the model estimated for its own error.
@jmercat very interrsting, I have to process this información though... I'm working with a multimodal model (CVAE) so I will find your work very suitable for me
Given the deep understanding you have on the field I"d like to drop another question: on inference time when no ground truth is availabme, is the likelihood of the predicitons evaluated with the vector of means, so the exponential part of the likelihood cancels out?
Thank's so much for your time, I really appreacite not only the formal contribution a PhD researcher delivers in the form of papers and thesis, but taking the time to sahre knowledge through informal ways like this :)
Thanks, my pleasure.
I am not sure I understand your question: You do not estimate the likelihood of the prediction. It cannot be measured. You have to reverse your perspective: You predict a distribution such that the truth should be likely (for that predicted distribution). So you estimate the likelihood of the truth for the given distribution.
In the case of a gaussian, this is done by computing the Gaussian expression. x, y being the ground truth, rho, sigma_x, sigma_y, mu_x, mu_y are your prediction:
Yes, this is true and I think is a typical issue missleading the concept of the likelihood... I'm meaming, given a multimodal Gaussian decoder, which predicitons should be considered as better? Those with the snaller mean covaitance matrix along the T future time steps?
Oh ok I get what you mean. No actually, it is not because your prediction is more or less scattered that it is more or less likely. So you also need to predict a probability score for each mode.
Then, should I need to add another output to the network with softmax activation and add the cross-entropy loss to the loss of the model?
The I will be very concerned about the calibration of the model... which yield us to the fascinating field of bayesian neural networks, hahaha :)
This is a fantastic piece of information! :) In this equation, K is for the mode? what is (i)? and n_mix is the numner of future time steps?
I was thinking about using cross-entropy loss generating hard labels by assigning a probabilitu of 1 to the prediciton with lower MAE/MSE during training (evaluated with the means of the estimated Gaussians) Do you think this could work too?
The k and (i) here can be forgotten. k it is the time step. (i) stands for the i^th sequence in the dataset. There are n_mix number of modes (number of mixture components in the Gaussian mixture).
Your idea could also work and I think this is what is done in the code of Deo here. It might loose its meaning as the maximization of the likelihood... but it allows you to train only the most probable mode which improves the mode diversity in some cases.
You're right, the Deo work is multimodal in the sense of the maneuver, and he's masking with the most probable predicted maneuver... it is not the same multimodality of a generative model, but It helps to understand the formula for the NLL of the mxiture of Gaussians. It is worth to mention that my CAVE is not very diverse... I'm training as a deterministic decoder with MAE loss...
@jmercat thanks so much, i won't take more of yout time, at the moment... I'll process all of this information and look forward you publish your work! Thank you, It has been very elucidating! Congrats! Have you already published any paper?
Hi again,
Thanks, my pleasure.
I am not sure I understand your question: You do not estimate the likelihood of the prediction. It cannot be measured. You have to reverse your perspective: You predict a distribution such that the truth should be likely (for that predicted distribution). So you estimate the likelihood of the truth for the given distribution.
In the case of a gaussian, this is done by computing the Gaussian expression. x, y being the ground truth, rho, sigma_x, sigma_y, mu_x, mu_y are your prediction:
Just for clarification, this Likelihood that is being maximized in the training of the DNN is:
with the ground truth and the output vecotr of the DNN .
From your comment it seems to be the first one...
It is complex yes... just I was wondering... :)
And lastly, until you releases your PhD results..
ahah yes for the new output and the softmax but you can use the log-sum-exp of the NLL (this is the NLL of a Gaussian mixture) instead of the cross-entropy.
In order to compute the batch-wise loss? Should I sum or average through the axis time? For the batch axis, obviously average
I chose to average on my part but this should not have any impact if you change the learning rate accordingly.
Thanks a lot! I'm just looking at your work at Google Scholar, which is indeed very very interesting. I'll deep into your papers :) Thanks for your time again!
This is the correct computation:
eps_rho = 1e-6 ohr = 1/(np.maximum(1 - rho * rho, eps_rho)) #avoid infinite values out = 0.5*ohr * (diff_x * diff_x / (sigX * sigX) + diff_y * diff_y / (sigY * sigY) - 2 * rho * diff_x * diff_y / (sigX * sigY)) + np.log( sigX * sigY) - 0.5*np.log(ohr) + np.log(np.pi*2)
sigma values were inverted wich may be changed afterward by replacing output activations for sigma from exp(x) to exp(-x) but this does not affect the results if it is done consistently.
There is an error though: the 0.5 factor and the constant value np.log(np.pi*2) were forgotten.
The constant value does not affect the gradients so it is fine for the learning phase. However it should be fixed for the evaluation.
Hello! Thank you for offering the correct computation, but I still have the following questions:
-
In the
outputActivation
function in https://github.com/nachiket92/conv-social-pooling/blob/master/utils.py#L152,
by using the following code, why the sigX is converted to the reciprocal of the standard sigma(1/sigX)?
Line 152 in d1abe19
-
Because the sigX and sigY output by the model are the reciprocals of the standard deviations, so the division by sigma in
out
in the formula should be replaced with a multiplication sign:
out = 0.5*ohr * (diff_x * diff_x * sigX * sigX + diff_y * diff_y * sigY * sigY - 2 * rho * diff_x * diff_y / * sigX * sigY) - np.log( sigX * sigY) - 0.5*np.log(ohr) + np.log(np.pi*2)
- I modified
out
based on your formula, I can get the rmse results closed to the paper, but the null value is quite different from the original paper. After I run the program, the result of the null value is 5.3911(5s), while the paper is 4.22(5s). The null values in 1s,2s,3s,4s are also quite different from the original paper, so I want to ask what is the reason for this. And can you get the null value as the paper shows?
The question about null has troubled me for a long time, I hope you can help me answer the above question. Thanks very much!
Hi, it might be late but here is a link to my PhD manuscript where I write about this: https://jean-mercat.netlify.app/media/PhD.pdf
utils.py has been updated with the two changes. Thanks again @jmercat
Hi, I think this line of code is incorrect
Line 199 in d1abe19
0.5*torch.pow(sigY, 2)*torch.pow(y-muY, 2) - rho*torch.pow(sigX, 1)*torch.pow(sigY, 1)*(x-muX)*(y-muY)
This part should be changed to
torch.pow(sigY, 2)*torch.pow(y-muY, 2) - 2*rho*torch.pow(sigX, 1)*torch.pow(sigY, 1)*(x-muX)*(y-muY)
Same as line 170.
@nachiket92 @jmercat
Hi, guys
Could you point out how -0.5160
in the line below is calculated?
Lines 227 to 228 in d1abe19
It is using the same equation. Note that we're taking the negative of the log of f(x,y) shown above. Also sigX and sigY output by the model are the reciprocals of the standard deviations.
Thanks for the reply. I think I have most of this figured out, but I have two questions (both may stem from ignorance/a fundamental misunderstanding of the basic concepts).
- In the output activation, why is standard_devX,Y= e^(sigX,Y)? Reading Graves, I'm not sure I understand why to do this.
- Is the standard deviation used in the maximum likelihood supposed to come from the "real" variables (the actual future trajectories) rather than the standard deviation of the predicted variables? If so, then why use a model output for standard deviation? Wouldn't that just be based on the prediction rather than the "real" variables?
Thanks again.
for the question 1, the reason of e^(sigX,Y) is for sigX,Y > 0
Hi, it might be late but here is a link to my PhD manuscript where I write about this: https://jean-mercat.netlify.app/media/PhD.pdf
What's even more helpful (especially to someone like me who just entered this field) than this informative issue discussion is your dissertation. Thank you very much for linking it here @jmercat and impressive work.