rsquare not robust
Closed this issue · 1 comments
Looks like rsquare calculation is "SSR / SST". For a more robust solution (explanation below), I suggest using "1 - SSE / SST" instead. For this, the function could be rewritten as:
rsquare <- function(model, data) {
1 - stats::var(residuals(model, data)) / stats::var(response(model, data), na.rm = TRUE)
}
Happy to submit a PR later, but I've got one live now, so don't want to tangle them up.
Explanation
"SSR / SST" has few caveats. Very relevant (I think), it doesn't apply when a model fitted on one data set (training data) is used to predict new data (test set). Here's a worked example where we should get a negative R-Squared (predicted values are worse than the mean), but we get Inf
.
library(modelr)
set.seed(12)
# Training data
train <- data.frame(
x = 1:10,
y = 1:10 + rnorm(10, sd = .1)
)
# Test data with constant `y`
test <- data.frame(
x = 1:10,
y = 5
)
mod <- lm(y ~ x, train)
rsquare(mod, train)
#> [1] 0.9989631
rsquare(mod, test)
#> [1] Inf
# Safer to calculate R-squared using residuals (easy to see in plot)
plot(test, ylim = c(1, 10))
abline(coefficients(mod))
I agree, I just wanted to open the same issue since it's quite a severe methological problem. Using "SSR / SST" is a simplification that only holds under certain conditions and it definitely fails for test sets, which I consider a major use of model validation. The more general and robust way to calculate R^2, as it was pointed out already, is to use the originial definition: "1 - SSE / SST".