glm-tools/pyglmnet

pyglmnet returns only nan as coefficients (beta_ and beta0_)

erik-jan-climate opened this issue ยท 23 comments

Python 3.7.3 (default, Mar 27 2019, 22:11:17)

Erik Phone +49one51twotwo887293

All code is run in ipython

I followed this:
https://glm-tools.github.io/pyglmnet/auto_examples/plot_group_lasso.html#sphx-glr-auto-examples-plot-group-lasso-py

bash:
./pip install pyglmnet

#in ipython
from pyglmnet import GLMCV

#Input data X and y are stored in Pandas Dataframes which I then convert (see below):
X=
T850__mean T_2M__mean TD_2M__mean
0 275.155975 277.692291 276.955231
1 278.189697 284.367920 281.938110
2 270.829437 276.190979 272.350861
3 270.744324 277.318665 273.140411
4 273.041565 277.693665 274.393768
5 268.639587 274.262177 272.898468
6 269.230347 274.948334 271.464417
7 268.686707 276.265198 274.291565
8 267.068115 275.395508 273.623657
9 268.885742 277.743134 276.271698
10 277.509430 277.701263 275.212555
11 275.660004 279.134888 278.243286
12 266.246857 275.520111 273.073242
13 266.267181 274.186127 272.268036
14 265.423065 272.023621 269.310791
15 267.223267 271.640320 267.889038

y=
obs
0 1.0
1 1.0
2 1.0
3 1.0
4 2.5
5 1.8
6 1.0
7 1.1
8 1.0
9 1.0
10 1.0
11 1.0
12 1.3
13 1.0
14 1.0
15 1.0

glm = GLMCV(distr="binomial", tol=1e-3,
score_metric="pseudo_R2",
alpha=1.0, learning_rate=3, max_iter=100, cv=3, verbose=True)

X.to_numpy()=
array([[275.15598, 277.6923 , 276.95523],
[278.1897 , 284.36792, 281.9381 ],
[270.82944, 276.19098, 272.35086],
[270.74432, 277.31866, 273.1404 ],
[273.04156, 277.69366, 274.39377],
[268.6396 , 274.26218, 272.89847],
[269.23035, 274.94833, 271.46442],
[268.6867 , 276.2652 , 274.29156],
[267.0681 , 275.3955 , 273.62366],
[268.88574, 277.74313, 276.2717 ],
[277.50943, 277.70126, 275.21255],
[275.66 , 279.1349 , 278.2433 ],
[266.24686, 275.5201 , 273.07324],
[266.26718, 274.18613, 272.26804],
[265.42307, 272.02362, 269.3108 ],
[267.22327, 271.64032, 267.88904]], dtype=float32)

y.to_numpy().flatten()=
array([1. , 1. , 1. , 1. , 2.5, 1.8, 1. , 1.1, 1. , 1. , 1. , 1. , 1.3,
1. , 1. , 1. ])

g=glm.fit(X.to_numpy(),y.to_numpy().flatten())

ipython:
In [35]: g.beta_
Out[35]: array([nan, nan, nan])

Problem:
nan everywhere. What is my mistake?

So far, I have increased input data up to 7000 x 10 values for X. and 7000 for y. Nothing changed. Still nan everywhere.

It's a bit hard to help based on this. Could you perhaps provide a full script that reproduces the problem and share a small part of your data along with it? thanks

To jasmainak:
Can I call you? I have shared a small part of the data already. It is stored in MB-to-GB-large binary (netcdfs).

Here is a more minimal version of what causes the nans:
X and y are pandas Dataframes, X.shape= (9,2), y.shape()=(9,1)

X=   
  T850__mean  T_2M__mean
0  275.155975  277.692291
1  278.189697  284.367920
2  270.829437  276.190979
3  270.744324  277.318665
4  273.041565  277.693665
5  268.639587  274.262177
6  269.230347  274.948334
7  268.686707  276.265198
8  267.068115  275.395508

y=
   obs
0  0.0
1  0.0
2  0.0
3  0.0
4  1.5
5  0.8
6  0.0
7  0.1
8  0.0

I copied the glm setup (next line) from: https://glm-tools.github.io/pyglmnet/auto_examples/plot_group_lasso.html#sphx-glr-auto-examples-plot-group-lasso-py

glm = GLMCV(distr="gamma", tol=1e-3, 
            score_metric="pseudo_R2", alpha=1.0,  
            learning_rate=3, max_iter=100, cv=3, verbose=True)

g = glm.fit(X.to_numpy(),y.to_numpy().flatten())

g.beta_ 
returns: 
array([nan, nan])

In fact, I am rather sure that one can simply make up any 10 numbers and store them in X. And any random numbers for y. After fitting, the coefficients will be nans.

As I said above, please provide a full reproducible script that I can copy-paste. Happy to take it from there :) Thanks!

Here you are:

import numpy as np 
import pandas as pd 
from pyglmnet import GLMCV
glm = GLMCV(distr="gamma", tol=1e-3, 
            score_metric="pseudo_R2",
            alpha=1.0, learning_rate=3, max_iter=100, cv=3, verbose=True)

X = pd.DataFrame(np.random.randint(0, 100, size=(6, 2)), columns=list('AB'))
y = pd.DataFrame(np.random.randint(0, 10, size=(6, 1)), columns=list('y'))
dum2 = glm.fit(X.to_numpy(), y.to_numpy().flatten())
dum2.beta_

To jasmainak:
Is this script now something you can work with?

You have set the learning rate too high, set it to 1e-8 and it should work

I did the same with my real data (4 sample predictors) and got:

print(g.beta_) 
[0. 0. 0. 0.]
print(g.beta0_)                                                                      
0.06673483228915175

My ytrain is always small and equal (90%) or larger(10%) zero: eg.[ 0., 0., 0., 0., 0., 0., 0.01, 0., 0.3, 0, 0, 0.15, 0, 0, 0, ...]

The result beta_ = [0. 0. 0. 0.] seems to be wrong. I expect them to be very small but different from zero, for example [0.0000247, 0.004, 0.001, 0.006]

I played a bit by providing more predictors or more values per predictor and still got these return values.

Before preparing my next minimal example with real data: is it possible that the beta_ are truncated when printing them? Once, I also got [-0. -0. -0. -0.] which is why I wonder how I can display all (possibly cut off?) digits of these coefficients?

Download this file first and extract it in your python working dir:
py.zip

import sys
import numpy as np
import pickle
import pandas as pd
import datetime # Python standard library datetime module
import random 
import copy
from pyglmnet import GLMCV
from pyglmnet.datasets import fetch_group_lasso_datasets
from IPython.core.interactiveshell import InteractiveShell


pd.set_option('display.max_rows', 1000) #_pandas: Print 1000 rows, do not truncate
np.set_printoptions(threshold=np.inf)
np.set_printoptions(precision=16)
np.set_printoptions(suppress=False)
#np.set_printoptions(edgeitems=3,infstr='inf', linewidth=75, nanstr='nan', precision=8, suppress=False, threshold=1000, formatter=None)

InteractiveShell.ast_node_interactivity = "all"  #_output all called vars, not only the one that was called as the last


X = pickle.load( open( "X.pickle", "rb" ) )
y = pickle.load( open( "y.pickle", "rb" ) )


glm = GLMCV(distr="gamma", tol=1e-3, score_metric="pseudo_R2", alpha=1.0, learning_rate=1e-8, max_iter=100, cv=3, verbose=True)

g = glm.fit(X.to_numpy(),y.to_numpy().flatten())
g.beta_
g.beta0_

The shape of beta depends on the shape of your input data. There is no truncation per se, but if you have too many values in an array, it may not display them. This does not appear to be the case for you.

I edited your post above so it's formatted as a python script. However, I can't run it by copy-pasting. If you can do that, I can help you further :)

I can run it.
Up to which line can you run it by copy-pasting?
You need to download the zip and extract it to then pickle.load the input values.
As this input data is large (and will be larger soon), a pickle file is the best way I know to get the data to you. If there is a better way to automatically save the data contained in a pandas dataframe and send it to you, let me know.
I edited above py-script to make it even more clear that the download+extraction is not part of the script but need to be done first. I thought this was obvious.
:)

great, I can replicate now :) I edited your script to remove the netcdf part as it's not needed.

I need to ponder on this a bit. In the meanwhile, do you have an intuition as to why you'd expect the betas to be different from 0.?

Because all 0. means the y data can be best predicted by beta0_ alone, which is a simple constant. But the y values vary between many different values. So it cannot be a simple constant. Also, when providing 10times the amount of X and y values, all betas are still 0. I ran the model 10times, and always used different dimensions (=number of records/samples) for X and y. glmcv always returned betas = 0. As the input data X and y result from independent measurements and predictions, the data cannot be by chance so "perfect, strange" that all betas can only be zero. This would be like a 1 in 1 trillion chance to just meet this beta configuration by chance with this input data.

@erik-jan-climate i had a brief look at your data and code. a few comments to help you model this the right way.

  • 15/19 samples in y are zeros so i wouldn't be too surprised if a well tuned model tries to capture all the variance in the intercept term.
  • the columns in X are of vastly different scales. i would recommend z-scaling the data. you can do this with sklearn.preprocessing.StandardScaler before fitting a model.
  • you are using alpha=1.0 which is equivalent to Lasso regression (L1 penalty). i would not advise this with only 4 predictors in X. Lasso should be used when you have a large number of predictors and you expect many of them not to have any predictive power.

i did a sanity check that if i model your data with linear regression, GLM and sklearn.linear_model.LinearRegression gives equivalent results.

try this:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from pyglmnet import GLM

# Scale the data
sc = StandardScaler()
Xsc = sc.fit_transform(X.to_numpy())
ysc = y.to_numpy().flatten()

# Fit a Gaussian GLM without regularization
glm = GLM(distr="gaussian",
                    tol=1e-5,
                    score_metric="pseudo_R2",
                    alpha=0.,
                    reg_lambda=0.,
                    learning_rate=2e-1,
                    max_iter=1000, verbose=True)
glm.fit(Xsc, ysc)

# Fit a linear regression model (should produce equivalent results as above)
model = LinearRegression()
model.fit(Xsc, ysc)

# Print the model coefficients
print(glm.beta0_, glm.beta_)
print(model.intercept_, model.coef_)

if you'd like to model your data with a Gamma distribution, feel free to try but make sure you play around with reg_lambda, alpha, max_iter, tol, learning_rate. thanks for your question and happy data science!

@jasmainak one learning for us from here is that i had to increase max_iter and decrease tol wrt the defaults in order to get equivalent results to sklearn's LinearRegression. perhaps we can revisit the defaults and incorporate those into a test.

you are using alpha=1.0 which is equivalent to Lasso regression (L1 penalty). i would not advise this with only 4 predictors in X. Lasso should be used when you have a large number of predictors and you expect many of them not to have any predictive power.

This is an important point. When I tried with alpha=0.0, I did not get betas which are all 0. @pavanramkumar I think we should document clearly what alpha is. I always have to look at the code to know if alpha=0.0 is L1 or L2 ...

First of all: Thanks to both of you.

Concerning pavanramkumar answer's point 1) and 3) :
It was just an extremely minimal sample. In fact, I like to run it on 150predictors and about 100MB+ of input data.

I will consider your point 2).

Just to be on the same page, I like to give you now more data because it seems relevant to not be minimal anymore:

Download this file first and extract it in your python working dir:
glmnetPy.zip

It is still just 10KB. (10predictors, 218steps, still very minimal). If your answers depend on the amount of input data, I can give you even more input data.
I still get nans for this (or zeros).

Can you reproduce? Any ideas? (apart from pavanramkumar-point (2), which I will still try)

@pavanramkumar @jasmainak
All 3 points are now incorporated in the script: More data, standardize, less zeros in y. However, it still returns nans oder 0.

sc = StandardScaler()
Xsc = sc.fit_transform(X.to_numpy())
ysc = y.to_numpy().flatten()

Any idea is appreciated. If you think this can be tested/improved better with more input data, please let me know.

@titipata do you have time to investigate what is going on?

@jasmainak I will try to investigate by the end of this week.

hi everyone! I'm trying out your library and having similar issues trying to fit a regularized Gamma GLM. My predictions are all nan most of the time, I'm trying out different settings for the learning_rate, tol and iterations, but it seems tough to get something that makes sense. Any tips on this one? I'm going to check whether similar libraries in R also suffer from this issue with my dataset. Thanks for any advice!

ummm ... do you mind sharing your data and a small snippet of your code?