mathurinm/celer

Warm starts and CPUs in use for the GroupLasso

georgestod opened this issue · 6 comments

Hi!
I was wondering how you were leveraging warm starts in GroupLassoCV?
Perhaps between folds in for a given alpha?
By default, it seems many cores are being used by GroupLasso or GroupLassoCV? In the case of GroupLassoCV, performance seems to decrease when njobs is set to -1.

Thanks for your work in any case !

Hi @georgestod ,
Thanks for your feedback.
Warm start is leveraged inside each fold, from one alpha to the next.
Using warm start across folds could cause data leakage (if the problems are not solved up to high precision) in my opinion.

The default parameter is n_jobs=None, so a single core should be used. In my experience using a higher n_jobs helps (at least for the Lasso), can you put here a script with timings when using n_jobs=-1 hurts performance ?

Thanks for your quick reply, a minimal example would be this one:

from celer import GroupLassoCV
import numpy as np
from time import time
from scipy.linalg import block_diag

### my data consists of 3d arrays
X = np.array([[[3, 1], [1, 0], [1, 0]],[[0, 2], [2, 3], [2, 3]]], dtype=float)
y = X.sum(axis=2) + 2


### I put them in block format
thetas = []
dts    = []
for i in range(2):
    thetas.append(X[i,:,:])
    dts.append(y[i,:])
X_BD = block_diag(*[theta for theta in thetas])
y_BD = np.concatenate( dts, axis=0 )

### And I create some groups
n_terms = X.shape[2]
n_datasets = X.shape[0]
gps=[(coeff_idx + n_terms * np.arange(n_datasets)).tolist() for coeff_idx in np.arange(n_terms)]


### Compare this
ts = time()
glr = GroupLassoCV(groups=gps,tol=1e-5).fit(X_BD, y_BD)
glr.coef_

print(time()-ts)
### will result in ~0.43s

### versus this
ts = time()
glr = GroupLassoCV(groups=gps,tol=1e-5,n_jobs=-1).fit(X_BD, y_BD)
glr.coef_

print(time()-ts)
### will result in ~0.51s

With my actual data (not the one here), celer converges for a small alpha and I know the results are good, the difference in speed is much more significant.

The timings will not be significant with such small X as there are many overheads. can you put a link to your real data under .npy format (with np.save ?)

You can format your code in github issues surrounding it with triple backticks (3 times the ` character), like this:

some
formatted
code

Ok, you should just need to copy paste now and be able to download the data from here:
https://drive.google.com/drive/folders/18kk3VUvXO_BCJJRzrGlMQbN8K1BPgwhn?usp=sharing

Thanks a lot !

import numpy as np
from time import time
from scipy.linalg import block_diag
from sklearn.utils import shuffle


# my data consists of 3d arrays
theta = np.load('kdv_theta_GT.npy')
dt = np.load('kdv_dt_GT.npy')

# normalizing dt and library time wise
normed_dt = dt/np.linalg.norm(dt,axis=1,keepdims=True)
normed_theta = theta/np.linalg.norm(theta,axis=1,keepdims=True)


# preparing groups
n_terms = theta.shape[2]
n_datasets = theta.shape[0]
gps=[(coeff_idx + n_terms * np.arange(n_datasets)).tolist() for coeff_idx in np.arange(n_terms)]

# preparing block diagonal matrix
thetas = []
dts    = []
for i in range(4):
    thetas.append(normed_theta[i,:,:])
    dts.append(normed_dt[i,:,0])
thetaBD = block_diag(*[theta for theta in thetas])
dtBD = np.concatenate( dts, axis=0 )

# shuffling data
idx_data = np.arange(thetaBD.shape[0])
idx_ = shuffle(idx_data, random_state=0) 
X_BD = thetaBD[idx_,:]
y_BD    = dtBD[idx_,]



# Compare this
ts = time()
glr = GroupLassoCV(groups=gps,cv=10,tol=1e-5).fit(X_BD, y_BD)
glr.coef_

print(time()-ts) # takes ~1s


# Versus this
ts = time()
glr = GroupLassoCV(groups=gps,cv=10,tol=1e-5,n_jobs=-1).fit(X_BD, y_BD)
glr.coef_

print(time()-ts) # takes ~3s

Ok, I do not get the same timings as you, but still it's problematic: 30 s both for n_jobs=1, and n_jobs=-1. When I do the first run, I see all cores being used in my machine.

@agramfort running GroupLassoCV(n_jobs=1) still uses all cores on my machine, though with verbose=1 I can check that path computations happen one after the other. Weird no ?

data is in the above commentary if you want to have a look (script is 2 min to run)

import numpy as np
from time import time
from scipy.linalg import block_diag
from sklearn.utils import shuffle
from celer import GroupLassoCV

# my data consists of 3d arrays
theta = np.load('kdv_theta_GT.npy')
dt = np.load('kdv_dt_GT.npy')

# normalizing dt and library time wise
normed_dt = dt/np.linalg.norm(dt, axis=1, keepdims=True)
normed_theta = theta/np.linalg.norm(theta, axis=1, keepdims=True)


# preparing groups
n_terms = theta.shape[2]
n_datasets = theta.shape[0]
gps = [(coeff_idx + n_terms * np.arange(n_datasets)).tolist()
       for coeff_idx in np.arange(n_terms)]

# preparing block diagonal matrix
thetas = []
dts = []
for i in range(4):
    thetas.append(normed_theta[i, :, :])
    dts.append(normed_dt[i, :, 0])
thetaBD = block_diag(*[theta for theta in thetas])
dtBD = np.concatenate(dts, axis=0)

# shuffling data
idx_data = np.arange(thetaBD.shape[0])
idx_ = shuffle(idx_data, random_state=0)
X_BD = thetaBD[idx_, :]
y_BD = dtBD[idx_, ]


cv = 4
tol = 1e-5
verbose = 0   # put 1 to check that computation is run in parallel

# Compare this
ts = time()
glr = GroupLassoCV(groups=gps, verbose=verbose, n_jobs=1, cv=cv, tol=tol).fit(X_BD, y_BD)
glr.coef_
t_jobs1 = time()-ts
print(f"time n_jobs=1 {time()-ts} s")  # takes ~30s


# Versus this
ts = time()
glr = GroupLassoCV(groups=gps, cv=cv, tol=tol,
                   n_jobs=-1, verbose=1).fit(X_BD, y_BD)
glr.coef_
t_jobsall = time()-ts
print(f"time n_jobs=-1 {time()-ts} s")  # takes ~30s too

I believe this is due to BLAS using multiple jobs by default in operations such as dot product etc.

This can be disabled with the fix mentioned in scikit-learn-contrib/skglm#48

This needs to be done before numpy is loaded, so usually best done in the .bashrc