-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added warning , if column is multi categorical in h2o.cor() #12903 #15674
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Mohit1345. It will need a bit more changes before merging. Please don't be discouraged by the requested changes, h2o's code takes some time to get used to.
if y is None: | ||
y = self | ||
if use is None: use = "complete.obs" if na_rm else "everything" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be deleted.
if y is None: | ||
y = self | ||
if use is None: use = "complete.obs" if na_rm else "everything" | ||
y_categorical = any(self.types[col_name] == "enum" for col_name in y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect - y
H2OFrame not a list of columns. Also you can use y.isfactor()
to check if it's categorical - the output is a list of boolean values in the same order as are the columns.
y_categorical = any(self.types[col_name] == "enum" for col_name in y) | ||
|
||
if y_categorical: | ||
num_unique_levels = {col: len(self[col].levels()) for col in y} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of len(self[col].levels())
use self[col].nlevels()[0]
. nlevels
returns the just a list of cardinalities so there is less communication with the backend and lower memory use in the python client. Also since the y
is an H2OFrame (see the assert on the line 3182) you can use something like dict(zip(y.columns, y.nlevels()))
to get the same thing.
|
||
if multi_categorical: | ||
import warnings | ||
warnings.warn("NA") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the warning more informative.
For example:
for col, card in num_unique_levels.items():
if card > 2:
warnings.warn("Column {} contains {} levels. Only numerical and binary columns are supported.".format(col, card))
|
||
if ((x_categorical && length(unique(h2o.levels(x))) > 2) || (y_categorical && length(unique(h2o.levels(y))) > 2)) { | ||
warning("NA") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the warning more informative.
Sure, will try to make changes |
Hi @tomasfryda @wendycwong , I have raised a PR for this issue, please take a look at it here |
Hi @tomasfryda @wendycwong please review this PR #16070 |
fixes #12903
It will return "NA" if multi categorical columns are passed in it.
Modified in both R and python h2o.cor() method.