tidyverse/modelr

rsquare not robust

Closed this issue · 1 comments

Looks like rsquare calculation is "SSR / SST". For a more robust solution (explanation below), I suggest using "1 - SSE / SST" instead. For this, the function could be rewritten as:

rsquare <- function(model, data) {
  1 - stats::var(residuals(model, data)) / stats::var(response(model, data), na.rm = TRUE)
}

Happy to submit a PR later, but I've got one live now, so don't want to tangle them up.

Explanation

"SSR / SST" has few caveats. Very relevant (I think), it doesn't apply when a model fitted on one data set (training data) is used to predict new data (test set). Here's a worked example where we should get a negative R-Squared (predicted values are worse than the mean), but we get Inf.

library(modelr)

set.seed(12)
# Training data
train <- data.frame(
  x = 1:10,
  y = 1:10 + rnorm(10, sd = .1)
)
# Test data with constant `y`
test <- data.frame(
  x = 1:10,
  y = 5
)

mod <- lm(y ~ x, train)
rsquare(mod, train)
#> [1] 0.9989631

rsquare(mod, test)
#> [1] Inf

# Safer to calculate R-squared using residuals (easy to see in plot)
plot(test, ylim = c(1, 10))
abline(coefficients(mod))

rplot

I agree, I just wanted to open the same issue since it's quite a severe methological problem. Using "SSR / SST" is a simplification that only holds under certain conditions and it definitely fails for test sets, which I consider a major use of model validation. The more general and robust way to calculate R^2, as it was pointed out already, is to use the originial definition: "1 - SSE / SST".