Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP018 Pandas output for transformers with set_output #68
SLEP018 Pandas output for transformers with set_output #68
Changes from 10 commits
f140b8a
e708c2a
7d1528e
1186280
ac21785
326fcbb
d91fb3a
218b76a
2d23470
eac9f9f
add7a8a
0580391
4370736
1d57415
68ada33
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is sparse data treated? If I do
set_output(transform="pandas")
on aOneHotEncoder(sparse=True)
, will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?For
CountVectorizer
, erroring seems natural, for OHE it seems strange to require the user to set the output intention in two places.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error in both cases.
I think it's strange not to error. If
sparse=True
and a dense pandas array is returned, then it is not consistent withsparse=True
. As long as pandas + sparse is not supported, I prefer the explicitness of configuring in two places.From a workflow point of view, the OHE will be in a pipeline, so it will end up to be one extra configuration (
sparse=True
). This is becauseset_output
will end up being called on pipeline which configures everything:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, in a pipeline it's not so bad. It might be worth calling out in the OHE documentation or in the error message? I agree it's better to be explicit, but it might also be surprising to users who don't know that the default is
sparse=True
, in particular because it might be invisible if used in a column transformer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they have to opt-in to respect the global flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no backward compatibility concern as long as we don't change the default of
sklearn.set_config(transform_output=XXX)
, correct?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a backward compatibility concern, but there is an issue around if a third party transformer should respect the global flag. Concretely:
I'm leading toward letting the library decide if it wants to respect the global configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly that was the case that seemed underspecified to me. I'm ok to leave it up to the library, which means we won't add it to the common tests that this errors out if it's not supported. Not sure if it's worth adding that to the doc as a sentence/half-sentence? Otherwise the doc doesn't really tell the third party estimator authors what they should be doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sparse data is now in future work, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.