-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyglmnet returns only nan as coefficients (beta_ and beta0_) #367
Comments
It's a bit hard to help based on this. Could you perhaps provide a full script that reproduces the problem and share a small part of your data along with it? thanks |
To jasmainak: |
Here is a more minimal version of what causes the nans: X=
T850__mean T_2M__mean
0 275.155975 277.692291
1 278.189697 284.367920
2 270.829437 276.190979
3 270.744324 277.318665
4 273.041565 277.693665
5 268.639587 274.262177
6 269.230347 274.948334
7 268.686707 276.265198
8 267.068115 275.395508
y=
obs
0 0.0
1 0.0
2 0.0
3 0.0
4 1.5
5 0.8
6 0.0
7 0.1
8 0.0 I copied the glm setup (next line) from: https://glm-tools.github.io/pyglmnet/auto_examples/plot_group_lasso.html#sphx-glr-auto-examples-plot-group-lasso-py glm = GLMCV(distr="gamma", tol=1e-3,
score_metric="pseudo_R2", alpha=1.0,
learning_rate=3, max_iter=100, cv=3, verbose=True)
g = glm.fit(X.to_numpy(),y.to_numpy().flatten())
g.beta_
returns:
array([nan, nan]) |
In fact, I am rather sure that one can simply make up any 10 numbers and store them in X. And any random numbers for y. After fitting, the coefficients will be nans. |
As I said above, please provide a full reproducible script that I can copy-paste. Happy to take it from there :) Thanks! |
Here you are: import numpy as np
import pandas as pd
from pyglmnet import GLMCV
glm = GLMCV(distr="gamma", tol=1e-3,
score_metric="pseudo_R2",
alpha=1.0, learning_rate=3, max_iter=100, cv=3, verbose=True)
X = pd.DataFrame(np.random.randint(0, 100, size=(6, 2)), columns=list('AB'))
y = pd.DataFrame(np.random.randint(0, 10, size=(6, 1)), columns=list('y'))
dum2 = glm.fit(X.to_numpy(), y.to_numpy().flatten())
dum2.beta_ |
To jasmainak: |
You have set the learning rate too high, set it to 1e-8 and it should work |
I did the same with my real data (4 sample predictors) and got: print(g.beta_)
[0. 0. 0. 0.]
print(g.beta0_)
0.06673483228915175 My The result I played a bit by providing more predictors or more values per predictor and still got these return values. Before preparing my next minimal example with real data: is it possible that the |
Download this file first and extract it in your python working dir: import sys
import numpy as np
import pickle
import pandas as pd
import datetime # Python standard library datetime module
import random
import copy
from pyglmnet import GLMCV
from pyglmnet.datasets import fetch_group_lasso_datasets
from IPython.core.interactiveshell import InteractiveShell
pd.set_option('display.max_rows', 1000) #_pandas: Print 1000 rows, do not truncate
np.set_printoptions(threshold=np.inf)
np.set_printoptions(precision=16)
np.set_printoptions(suppress=False)
#np.set_printoptions(edgeitems=3,infstr='inf', linewidth=75, nanstr='nan', precision=8, suppress=False, threshold=1000, formatter=None)
InteractiveShell.ast_node_interactivity = "all" #_output all called vars, not only the one that was called as the last
X = pickle.load( open( "X.pickle", "rb" ) )
y = pickle.load( open( "y.pickle", "rb" ) )
glm = GLMCV(distr="gamma", tol=1e-3, score_metric="pseudo_R2", alpha=1.0, learning_rate=1e-8, max_iter=100, cv=3, verbose=True)
g = glm.fit(X.to_numpy(),y.to_numpy().flatten())
g.beta_
g.beta0_ |
The shape of beta depends on the shape of your input data. There is no truncation per se, but if you have too many values in an array, it may not display them. This does not appear to be the case for you. I edited your post above so it's formatted as a python script. However, I can't run it by copy-pasting. If you can do that, I can help you further :) |
I can run it. |
great, I can replicate now :) I edited your script to remove the netcdf part as it's not needed. I need to ponder on this a bit. In the meanwhile, do you have an intuition as to why you'd expect the betas to be different from 0.? |
Because all 0. means the y data can be best predicted by beta0_ alone, which is a simple constant. But the y values vary between many different values. So it cannot be a simple constant. Also, when providing 10times the amount of X and y values, all betas are still 0. I ran the model 10times, and always used different dimensions (=number of records/samples) for X and y. glmcv always returned betas = 0. As the input data X and y result from independent measurements and predictions, the data cannot be by chance so "perfect, strange" that all betas can only be zero. This would be like a 1 in 1 trillion chance to just meet this beta configuration by chance with this input data. |
@erik-jan-climate i had a brief look at your data and code. a few comments to help you model this the right way.
i did a sanity check that if i model your data with linear regression, try this: from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from pyglmnet import GLM
# Scale the data
sc = StandardScaler()
Xsc = sc.fit_transform(X.to_numpy())
ysc = y.to_numpy().flatten()
# Fit a Gaussian GLM without regularization
glm = GLM(distr="gaussian",
tol=1e-5,
score_metric="pseudo_R2",
alpha=0.,
reg_lambda=0.,
learning_rate=2e-1,
max_iter=1000, verbose=True)
glm.fit(Xsc, ysc)
# Fit a linear regression model (should produce equivalent results as above)
model = LinearRegression()
model.fit(Xsc, ysc)
# Print the model coefficients
print(glm.beta0_, glm.beta_)
print(model.intercept_, model.coef_) if you'd like to model your data with a Gamma distribution, feel free to try but make sure you play around with |
@jasmainak one learning for us from here is that i had to increase |
This is an important point. When I tried with alpha=0.0, I did not get betas which are all 0. @pavanramkumar I think we should document clearly what alpha is. I always have to look at the code to know if alpha=0.0 is L1 or L2 ... |
First of all: Thanks to both of you. Concerning pavanramkumar answer's point 1) and 3) : I will consider your point 2). Just to be on the same page, I like to give you now more data because it seems relevant to not be minimal anymore: Download this file first and extract it in your python working dir: It is still just 10KB. (10predictors, 218steps, still very minimal). If your answers depend on the amount of input data, I can give you even more input data. Can you reproduce? Any ideas? (apart from pavanramkumar-point (2), which I will still try) |
@pavanramkumar @jasmainak sc = StandardScaler()
Xsc = sc.fit_transform(X.to_numpy())
ysc = y.to_numpy().flatten() Any idea is appreciated. If you think this can be tested/improved better with more input data, please let me know. |
@titipata do you have time to investigate what is going on? |
@jasmainak I will try to investigate by the end of this week. |
hi everyone! I'm trying out your library and having similar issues trying to fit a regularized Gamma GLM. My predictions are all nan most of the time, I'm trying out different settings for the learning_rate, tol and iterations, but it seems tough to get something that makes sense. Any tips on this one? I'm going to check whether similar libraries in R also suffer from this issue with my dataset. Thanks for any advice! |
ummm ... do you mind sharing your data and a small snippet of your code? |
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
Erik Phone +49one51twotwo887293
All code is run in ipython
I followed this:
https://glm-tools.github.io/pyglmnet/auto_examples/plot_group_lasso.html#sphx-glr-auto-examples-plot-group-lasso-py
bash:
./pip install pyglmnet
#in ipython
from pyglmnet import GLMCV
#Input data X and y are stored in Pandas Dataframes which I then convert (see below):
X=
T850__mean T_2M__mean TD_2M__mean
0 275.155975 277.692291 276.955231
1 278.189697 284.367920 281.938110
2 270.829437 276.190979 272.350861
3 270.744324 277.318665 273.140411
4 273.041565 277.693665 274.393768
5 268.639587 274.262177 272.898468
6 269.230347 274.948334 271.464417
7 268.686707 276.265198 274.291565
8 267.068115 275.395508 273.623657
9 268.885742 277.743134 276.271698
10 277.509430 277.701263 275.212555
11 275.660004 279.134888 278.243286
12 266.246857 275.520111 273.073242
13 266.267181 274.186127 272.268036
14 265.423065 272.023621 269.310791
15 267.223267 271.640320 267.889038
y=
obs
0 1.0
1 1.0
2 1.0
3 1.0
4 2.5
5 1.8
6 1.0
7 1.1
8 1.0
9 1.0
10 1.0
11 1.0
12 1.3
13 1.0
14 1.0
15 1.0
glm = GLMCV(distr="binomial", tol=1e-3,
score_metric="pseudo_R2",
alpha=1.0, learning_rate=3, max_iter=100, cv=3, verbose=True)
X.to_numpy()=
array([[275.15598, 277.6923 , 276.95523],
[278.1897 , 284.36792, 281.9381 ],
[270.82944, 276.19098, 272.35086],
[270.74432, 277.31866, 273.1404 ],
[273.04156, 277.69366, 274.39377],
[268.6396 , 274.26218, 272.89847],
[269.23035, 274.94833, 271.46442],
[268.6867 , 276.2652 , 274.29156],
[267.0681 , 275.3955 , 273.62366],
[268.88574, 277.74313, 276.2717 ],
[277.50943, 277.70126, 275.21255],
[275.66 , 279.1349 , 278.2433 ],
[266.24686, 275.5201 , 273.07324],
[266.26718, 274.18613, 272.26804],
[265.42307, 272.02362, 269.3108 ],
[267.22327, 271.64032, 267.88904]], dtype=float32)
y.to_numpy().flatten()=
array([1. , 1. , 1. , 1. , 2.5, 1.8, 1. , 1.1, 1. , 1. , 1. , 1. , 1.3,
1. , 1. , 1. ])
g=glm.fit(X.to_numpy(),y.to_numpy().flatten())
ipython:
In [35]: g.beta_
Out[35]: array([nan, nan, nan])
Problem:
nan everywhere. What is my mistake?
So far, I have increased input data up to 7000 x 10 values for X. and 7000 for y. Nothing changed. Still nan everywhere.
The text was updated successfully, but these errors were encountered: