mgcv vs pygam results
ilkot opened this issue · 5 comments
It is more than a question than an issue.
I'm trying to get same results with mgcv's and pygam's gam models but results are quite different.
here is the data with 500 observations
https://file.io/0gSYwb0hinBI
R:
example_r = read_csv("path/to/data/example_gam.csv")
gam_ex <- gam(y ~ x1+ s(x2,x3,x4,x5,x6),
family = gaussian,
method = "GCV.Cp",
data=example_r)
x1_s = 0
x2_s = -0.46232985
x3_s = 1.5689428
x4_s = -3.852407
x5_s = 8.378359
x6_s = 92.91196
sample <- data.frame(c(x1_s),c(x2_s),c(x3_s),c(x4_s),c(x5_s),c(x6_s))
names(sample) <- c('x1','x2','x3','x4','x5','x6')
predict(gam_ex, sample)
Output: -0.005903112
Python:
example_py = pd.read_csv("path/to/data/example_gam.csv")
X = example_py.drop(['y'],axis=1)
y = example_py['y']
gam = LinearGAM()
gam.fit(X,y)
x1_s = 0
x2_s = -0.46232985
x3_s = 1.5689428
x4_s = -3.852407
x5_s = 8.378359
x6_s = 92.91196
sample = pd.DataFrame(data = [{'x1':x1_s,'x2':x2_s,'x3':x3_s,'x4':x4_s,'x5':x5_s,'x6':x6_s}])
gam.predict(sample)
Output: 0.08549611
I'm suspicious about the formula in R because it is written as x1 + s(x2,...) but in pygam do we have a chance to write like that?
Any comment would be helpful!
Thanks in advance!
Hi @ilkot
Indeed, i also suspect that differences in the model specification could be the main cause of the discrepancies.
I am not very familiar with the MGCV syntax but i will try to answer.
I think the differences could come from:
- model specification
-
- your mgcv model
x1+ s(x2,x3,x4,x5,x6)
appears to have a linear term added to a nonlinear function (interaction?) with several terms
- your mgcv model
-
- in contrast your pyGAM model is fitting a smoothing function to each dimension of your X data.
-
- can you elaborate on what the
s(x2, x3, ...)
is doing?
- can you elaborate on what the
-
- if it is indeed an interaction term, you can achieve this with:
from pygam import l, te
LinearGAM(terms=l(0) + te(1, 2, 3, 4, 5))
- mgcv appears to be doing a model-selection/smoothing-selection step (method="GCV.cp")
-
- but your pyGAM fit is only fitting a single model with the default smoothing. You could try instead to search for a more optimal smoothing parameter:
import numpy as np
# set up a search-space
lam = np.logspace(-3, 5, 50)
lams = [lam] * 2 # here you are specifyng a l2 reg. for your linear term, and a shared smoothing for all dims. of your tensor term
gam.gridsearch(X, y, lam=lams)
gam.summary()
- mgcv might use a different default number of smoothing splines.
-
- pyGAM uses 20 splines for each nonlinear dimension by default
Let me know if this helps!
Thanks for the detailed answer, appreciate it!
s(x2,x3,..) is the geodimensional s function as shown below for 2 variables and yes it is an interaction term
I tried to fit as you suggest but it throws memory error which is quite interesting because dataset only contains only 500 rows and 7 columns in total
pygam\terms.py", line 1318, in build_penalties
P = sp.sparse.csc_matrix(np.zeros((self.n_coefs, self.n_coefs)))
MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64
I restart the kernel several times but it didn't change.
pygam: 0.8.0
python. 3.7.6