dswah/pyGAM

Estimated function is not consistent with data

robsonucl opened this issue · 1 comments

Hi,
I found this weird bug when plotting the estimated curve against the observed data. Any idea why this is happening? I only adapted your code to include my data (matrix X). Thank you!

`for i, term in enumerate(gam.terms):
if term.isintercept:
continue

XX = gam.generate_X_grid(term=i)
pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

plt.figure()
plt.plot(XX[:, term.feature], pdep)
plt.plot(XX[:, term.feature], confi, c='r', ls='--')
plt.scatter(X.iloc[:,i], y, facecolor='gray', edgecolors='none')

plt.title(repr(term))
plt.show()`

image

dswah commented

Ah wow this is a weird case indeed!
But at first glance it seems to me that the model has produced a good representation of the data. [1]

Can you tell me what you were expecting?

Also, could you share some details about the model setup?

  • distribution (normal, gamma, poisson...)
  • link function (linear, log, etc)
  • lam smoothing value and any other penalties

And then share a histogram of your Y-data / dependent variable?

[1] my interpretation:

  • It is hard to tell from the plot (due to alpha/transparency), but it seems like there is a lot of density close to S2=0 at least for X2<40.
  • It seems like the model has correctly identified this pattern (although with an offset) and has set the response of S2 to a pretty constant and low number.
  • At X2 > 40 there is little data, and the few samples have large response, so the model uses its flexibility to assign larger S2 values there.

I see some undesirable qualities of this model:

  1. which is the "ringing" towards the right side (in order to assign a high response to the single point at X2=45, the model needs to correct and add an overly negative response to the neighboring splines
  2. the obviously different model response between X2 < 40 (mostly 0) and X2 > 40 (large and positive).
  • to deal with point 1 i think we could add more flexibility to the model (more splines) so the neighboring splines have more freedom to react independently without needing to correct for the extreme values of their neighbors. You could also increase the lam smoothness penalty for this dimension in order to enforce more similar values (so that one isnt large and positive while its neighbor is large and negative)
    • to deal with point 2 you could simplify the model by changing this response of X2 from spline to linear, or alternatively keep splines functions for this dimension while increasing smoothness penalty.

Let me know what you think!