udacity/machine-learning

R2 definition in the Boston_housing project

Closed this issue · 0 comments

Hello !

I have some issues with the R2 definition in this project. The notebook states:

"The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable."
Which, I believe, is contradicted a few lines later:
"A model can be given a negative R2 as well, which indicates that the model is arbitrarily worse than one that always predicts the mean of the target variable."

I think some clarifications are needed. It captures the squared correlation (don't know why percentages are involved here), or the squared multiple correlation if there is more than one predictor, under one hypothesis : that cov(prediction, error)=0. That condition means that the model have kind of done it's job: the value of the prediction can not help to guess the error. If it could, then the model could be adjusted to take into account the guessed error in the prediction.

Obviously this condition holds for a well calibrated linear model, so the definition is true in that case. But it is misleading to believe that it hold for all prediction models. As suggested in the notebook, you can always find a worst model (with cov(prediction,error) far from zero) and get a negative R2, but then it is not a squared correlation anymore.

Hope that helps explaining this contradiction !
Have a good day,
Pierre