Benchmark linear models in higher dimensions

Question

Benchmark linear models in higher dimensions

ogrisel opened this issue 6 years ago · 4 comments

The current benchmarks only use 50 features for 1e6 samples. I would argue that this is not a case where won't would use a linear model as it would under-fit and the same test accuracy could probably be reached much faster with 1e3 data points instead of 1e6 yielding a speed up in the order of 1000x.

It would therefore be more interesting to benchmark linear regression, ridge regression and logistic regression in regimes in the order of 1e3 to 1e5 features.

In particular, Ridge regression is likely to be most useful in cases where num_features >> n_samples, otherwise, Linear regression (no penalty) is likely to give the same result.

Answer 1 · 2019-06-25T19:52:10.000Z

@ogrisel That's your suggestion is to keeping ration n/p on the sample from 10 to 1000 for the purposes of benchmarking?

Answer 2 · 2019-06-25T19:54:01.000Z

I do not think DAAL allows num_features > n_samples for Ridge regression but I see your point.

Answer 3 · 2019-07-05T15:22:56.000Z

Also, could you compare against Ridge(alpha=1e-9, solver="cholesky", copy_X=False) instead of LinearRegression, which should give the same result but much faster?

Answer 4 · 2019-07-05T16:26:35.000Z

I do not think DAAL allows num_features > n_samples for Ridge regression but I see your point.

Ok let's close this issue then as there nothing to do on DAAL's side. One may argue that users still want to run linear regression on those regime.

For reference a user also reported a related performance problem in scikit-learn: scikit-learn/scikit-learn#13923

I opened scikit-learn/scikit-learn#14268 and scikit-learn/scikit-learn#14269 on scikit-learn's side.