Verify R-squared calculation in ABM Viz tool

Question

Verify R-squared calculation in ABM Viz tool

Closed this issue 5 years ago · 12 comments

In working to improve the calibration for SOABM I am noticing that both Visum's internal calculation and when I bring the count comparison into excel calculate an R-squared for the PM peak of about ~0.78. The viz tool is reporting ~0.87 for the same data. The other measures, like the Gap statistics, help suggest that 0.78 is more correct than 0.87.

The team needs to go into the R code in the viz tool and verify that the calculations to determine the R-squared in count comparison makes sense. This is loosely connected to issue #65, which is related since it deals with the comparison of network data. So while either issue is being addressed / reviewed, the other issue should be addressed as well.

Answer 1 · 2018-10-11T21:19:57.000Z

Looking into this more - we might also need to verify that I am running the assignment correctly outside of the ABM - related to issue #58, where the user has to be able to understand and work with the capacities and assignment (and re-assignment) that is conducted in the ABM.

To clarify - it's possible that the R-sqaured difference is caused because I don't know how to properly duplicate the ABM assignment process outside of the ABM run - which would be an issue with the user guide.

Answer 2 · 2018-10-11T21:24:53.000Z

No - I think I just proved that the Viz tool r-squared is off. I went to an ABM result without altering (re-running) the assignment and plotted the counts versus assignment in excel and got ~0.77 for the PM Count. So I think the viz tool is not reporting the R-squared correctly.

Answer 3 · 2018-10-22T22:38:31.000Z

@bettinardi : I checked the Visualizer setup and verified that it plots the value computed by R's linear regression function. The R linear regression function - lm( ), computes the R-squared differently when the intercept is constrained. The R-squared is computed using y-mean value of zero, which results in a higher R-squared value. This is done to ensure that the R-squared value is greater than zero. However, in Excel and other software, the R-squared can be negative when the intercept in constrained. More documentation on this can be found here:
https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-does-summary_0028_0029-report-strange-results-for-the-R_005e2-estimate-when-I-fit-a-linear-model-with-no-intercept_003f

An alternative solution would be to report the linear equation and R-squared with an intercept.

Answer 4 · 2018-10-22T23:40:01.000Z

Excel allows the user to set the intercept to zero. Either way, I am not able to duplicate the the linear trends that the visualizer shows in any other software. Can you point me to the section of code that processes the count / volume comparison so that I can try and duplicate and understand what is occurring.

Answer 5 · 2018-10-23T16:45:18.000Z

Yes, Excel allows the user to set intercept to zero and so does R and other software. The difference is in how Excel computes R-squared with zero intercept vs how R computes it.
The link I shared explains how R computes it, which is different from how Excel does it.
This is not a data processing issue. Both Excel and R use the same data but deploy different formula for computing R-squared for the zero-intercept case.
This is what the link I shared says about this:

As described in ?summary.lm, when the intercept is zero (e.g., from y ~ x - 1 or y ~ x + 0), summary.lm() uses the formula R^2 = 1 - Sum(R[i]^2) / Sum((y[i])^2) which is different from the usual R^2 = 1 - Sum(R[i]^2) / Sum((y[i] - mean(y))^2). There are several reasons for this:

Otherwise the R^2 could be negative (because the model with zero intercept can fit worse than the constant-mean model it is implicitly compared to).
If you set the slope to zero in the model with a line through the origin you get fitted values y*=0
The model with constant, non-zero mean is not nested in the model with a line through the origin.
All these come down to saying that if you know a priori that E[Y]=0 when x=0 then the ‘null’ model that you should compare to the fitted line, the model where x doesn’t explain any of the variance, is the model where E[Y]=0 everywhere. (If you don’t know a priori that E[Y]=0 when x=0, then you probably shouldn’t be fitting a line through the origin.)

Answer 6 · 2018-10-23T16:52:10.000Z

The Visualizer can be updated to post Excel's R-squared value. If this is desired, then I will add this to the contingency scope.

Answer 7 · 2018-10-23T16:57:11.000Z

I guess I'm trying to understand which is correct.
The visualizer has an R-squared close to 0.9, excel and Visum have closer to 0.8. It doesn't seem like both methods could be correct and still be called the same thing - R-squared. So I'm trying to understand the difference. Currently I don't. Can you point me to the lines of code in the visualizer that do the lm work?

Answer 8 · 2018-10-23T17:12:14.000Z

The lines of code are in visualizer\templates\template.rmd file

The lm function is defined on lines 353-357

lm_eqn <- function(df){ m <- lm(y ~ x - 1, df); eq <- paste("Y = ", format(coef(m)[1], digits = 2), " * X, ", "r2=", format(summary(m)$r.squared, digits = 3), sep = "") return(eq) }

Check lines 2689-2840 to see how that function is applied on the data to produce the scatter plots

Answer 9 · 2018-10-23T17:29:24.000Z

Trying to dig in now. As I am - I'm noticing that it looks like this file should be removed from the repo - is it correct that this file should be removed: https://github.com/RSGInc/SOABM/blob/master/template/visualizer/templates/.Rhistory

Answer 10 · 2018-10-23T18:01:31.000Z

I sent an email with my work. I think the conclusion is that we need to change the lm to a simple lm(y~x) where we that the equation state a non-zero intercept and get the same result as excel.

Answer 11 · 2018-10-23T20:11:41.000Z

Yes, non-zero intercept would result in same R-squared value. I will update the Visualizer.

Answer 12 · 2019-01-17T00:11:59.000Z

All scatter plots will now have an intercept. Updated under contingency work