RSGInc/SOABM

Verify R-squared calculation in ABM Viz tool

Closed this issue · 12 comments

In working to improve the calibration for SOABM I am noticing that both Visum's internal calculation and when I bring the count comparison into excel calculate an R-squared for the PM peak of about ~0.78. The viz tool is reporting ~0.87 for the same data. The other measures, like the Gap statistics, help suggest that 0.78 is more correct than 0.87.

The team needs to go into the R code in the viz tool and verify that the calculations to determine the R-squared in count comparison makes sense. This is loosely connected to issue #65, which is related since it deals with the comparison of network data. So while either issue is being addressed / reviewed, the other issue should be addressed as well.

Looking into this more - we might also need to verify that I am running the assignment correctly outside of the ABM - related to issue #58, where the user has to be able to understand and work with the capacities and assignment (and re-assignment) that is conducted in the ABM.

To clarify - it's possible that the R-sqaured difference is caused because I don't know how to properly duplicate the ABM assignment process outside of the ABM run - which would be an issue with the user guide.

No - I think I just proved that the Viz tool r-squared is off. I went to an ABM result without altering (re-running) the assignment and plotted the counts versus assignment in excel and got ~0.77 for the PM Count. So I think the viz tool is not reporting the R-squared correctly.

@bettinardi : I checked the Visualizer setup and verified that it plots the value computed by R's linear regression function. The R linear regression function - lm( ), computes the R-squared differently when the intercept is constrained. The R-squared is computed using y-mean value of zero, which results in a higher R-squared value. This is done to ensure that the R-squared value is greater than zero. However, in Excel and other software, the R-squared can be negative when the intercept in constrained. More documentation on this can be found here:
https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-does-summary_0028_0029-report-strange-results-for-the-R_005e2-estimate-when-I-fit-a-linear-model-with-no-intercept_003f

An alternative solution would be to report the linear equation and R-squared with an intercept.

Excel allows the user to set the intercept to zero. Either way, I am not able to duplicate the the linear trends that the visualizer shows in any other software. Can you point me to the section of code that processes the count / volume comparison so that I can try and duplicate and understand what is occurring.

Yes, Excel allows the user to set intercept to zero and so does R and other software. The difference is in how Excel computes R-squared with zero intercept vs how R computes it.
The link I shared explains how R computes it, which is different from how Excel does it.
This is not a data processing issue. Both Excel and R use the same data but deploy different formula for computing R-squared for the zero-intercept case.
This is what the link I shared says about this:

As described in ?summary.lm, when the intercept is zero (e.g., from y ~ x - 1 or y ~ x + 0), summary.lm() uses the formula R^2 = 1 - Sum(R[i]^2) / Sum((y[i])^2) which is different from the usual R^2 = 1 - Sum(R[i]^2) / Sum((y[i] - mean(y))^2). There are several reasons for this:

Otherwise the R^2 could be negative (because the model with zero intercept can fit worse than the constant-mean model it is implicitly compared to).
If you set the slope to zero in the model with a line through the origin you get fitted values y*=0
The model with constant, non-zero mean is not nested in the model with a line through the origin.
All these come down to saying that if you know a priori that E[Y]=0 when x=0 then the ‘null’ model that you should compare to the fitted line, the model where x doesn’t explain any of the variance, is the model where E[Y]=0 everywhere. (If you don’t know a priori that E[Y]=0 when x=0, then you probably shouldn’t be fitting a line through the origin.)

The Visualizer can be updated to post Excel's R-squared value. If this is desired, then I will add this to the contingency scope.

I guess I'm trying to understand which is correct.
The visualizer has an R-squared close to 0.9, excel and Visum have closer to 0.8. It doesn't seem like both methods could be correct and still be called the same thing - R-squared. So I'm trying to understand the difference. Currently I don't. Can you point me to the lines of code in the visualizer that do the lm work?

The lines of code are in visualizer\templates\template.rmd file

The lm function is defined on lines 353-357

lm_eqn <- function(df){ m <- lm(y ~ x - 1, df); eq <- paste("Y = ", format(coef(m)[1], digits = 2), " * X, ", "r2=", format(summary(m)$r.squared, digits = 3), sep = "") return(eq) }

Check lines 2689-2840 to see how that function is applied on the data to produce the scatter plots

Trying to dig in now. As I am - I'm noticing that it looks like this file should be removed from the repo - is it correct that this file should be removed: https://github.com/RSGInc/SOABM/blob/master/template/visualizer/templates/.Rhistory

I sent an email with my work. I think the conclusion is that we need to change the lm to a simple lm(y~x) where we that the equation state a non-zero intercept and get the same result as excel.

Yes, non-zero intercept would result in same R-squared value. I will update the Visualizer.

All scatter plots will now have an intercept. Updated under contingency work