rgiordan/zaminfluence

`ComputeModelInfluence` is slow with a large number of fixed effects

Closed this issue · 6 comments

When a regression model has a large number (~ 1000) of fixed effects, ComputeModelInfluence is prohibitively slow.

The main culprit is in the for loop of GetRegressionSEDerivs, which calls solve with a QR decomposition of X'X for each regressor.

Probably the user doesn't need the standard error sensitivity for every fixed effect, so the easiest solution is only compute the sensitivity for a small user-specified set of regressors.

Even better would be to somehow use sparse design matrices, but that is probably not easily supported as a method for lm objects. One possibility would be to convert X to a sparse matrix at the beginning of GetRegressionSEDerivs.

I don't think the problem is solved by only computing the sensitivity for specific coefficients. The reason is (what I believe to be) the problematic for loop here actually needs to be over all the components, as it is part of the chain rule applied to the whole betahat vector here. I just forgot about this when I wrote up this issue.

I think that it might help to express subsets of the covariance matrix using Schur complements, but that differentiating explicitly will be tedious. It makes me want to revisit autodiff solutions that will work with R.

In the meantime, perhaps an option to not compute the sensitivity of standard errors can be a patch in the meantime.

Conversion of all variables to deviations from mean within the fixed effects and then estimating ComputeModelInfluence on these transformed X matrix should really help, no?

@akarlinsky I think the problem is differentiating through that operation by hand, which will be tedious. And, of course, your particular formulation only works with binary-valued fixed effect regressors, and we'd like to support the general case of high-dimensional regression where you only care about sensitivity of a few regressors. The best way to do this is probably to express the needed sub-matrix of the standard error covariance matrix using a Schur complement, but, again, differentiating this by hand will be tedious.

I think that replacing all my hand-coded derivatives with R's torch package will solve this problem and even improve the speed for low-dimensional case. The key is that you can get derivatives for a subset of the parameters and torch deals with the rest.

Work in progress:

https://github.com/rgiordan/zaminfluence/blob/fix_15/examples/rtorch_experiments.R

The fix_15 branch seems to have fixed this problem by allowing users to specify a keep_pars argument to ComputeModelInfluence. I want to test a little more before merging.

See #32