-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP018 Pandas output for transformers with set_output #68
Changes from all commits
f140b8a
e708c2a
7d1528e
1186280
ac21785
326fcbb
d91fb3a
218b76a
2d23470
eac9f9f
add7a8a
0580391
4370736
1d57415
68ada33
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ | |
|
||
slep012/proposal | ||
slep013/proposal | ||
slep018/proposal | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
.. _slep_018: | ||
|
||
======================================================= | ||
SLEP018: Pandas Output for Transformers with set_output | ||
======================================================= | ||
|
||
:Author: Thomas J. Fan | ||
:Status: Draft | ||
:Type: Standards Track | ||
:Created: 2022-06-22 | ||
|
||
Abstract | ||
-------- | ||
|
||
This SLEP proposes a ``set_output`` method to configure the output data container of | ||
scikit-learn transformers. | ||
|
||
Detailed description | ||
-------------------- | ||
|
||
Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse | ||
matrices. This SLEP proposes adding a ``set_output`` method to configure a | ||
transformer to output pandas DataFrames:: | ||
|
||
scalar = StandardScalar().set_output(transform="pandas") | ||
scalar.fit(X_df) | ||
|
||
# X_trans_df is a pandas DataFrame | ||
X_trans_df = scalar.transform(X_df) | ||
|
||
The index of the output DataFrame must match the index of the input. If the | ||
transformer does not support ``transform="pandas"``, then it must raise a | ||
``ValueError`` stating that it does not support the feature. | ||
|
||
This SLEP's only focus is dense data for ``set_output``. If a transformer returns | ||
sparse data, e.g. `OneHotEncoder(sparse=True), then ``transform`` will raise a | ||
``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output | ||
might be the scope of another future SLEP. | ||
|
||
For a pipeline, calling ``set_output`` on the pipeline will configure all steps | ||
in the pipeline:: | ||
|
||
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) | ||
num_preprocessor.set_output(transform="pandas") | ||
|
||
# X_trans_df is a pandas DataFrame | ||
X_trans_df = num_preprocessor.fit_transform(X_df) | ||
|
||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# X_trans_df is again a pandas DataFrame | ||
X_trans_df = num_preprocessor[0].transform(X_df) | ||
|
||
Meta-estimators that support ``set_output`` are required to configure all inner | ||
transformer by calling ``set_output``. Specifically all fitted and non-fitted | ||
inner transformers must be configured with ``set_output``. This enables | ||
``transform``'s output to be a DataFrame before and after the meta-estimator is | ||
fitted. If an inner transformer does not define ``set_output``, then an error is | ||
raised. | ||
|
||
|
||
Global Configuration | ||
.................... | ||
|
||
For ease of use, this SLEP proposes a global configuration flag that sets the output for all | ||
transformers:: | ||
|
||
import sklearn | ||
sklearn.set_config(transform_output="pandas") | ||
|
||
The global default configuration is ``"default"`` where the transformer | ||
determines the output container. | ||
|
||
The configuration can also be set locally using the ``config_context`` context | ||
manager: | ||
|
||
from sklearn import config_context | ||
with config_context(transform_output="pandas"): | ||
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) | ||
num_preprocessor.fit_transform(X_df) | ||
|
||
The following specifies the precedence levels for the three ways to configure | ||
the output container: | ||
|
||
1. Locally configure a transformer: ``transformer.set_output`` | ||
2. Context manager: ``config_context`` | ||
3. Global configuration: ``set_config`` | ||
|
||
Implementation | ||
-------------- | ||
|
||
A possible implementation of this SLEP is worked out in :pr:`23734`. | ||
|
||
Backward compatibility | ||
---------------------- | ||
|
||
There are no backward compatibility concerns, because the ``set_output`` method | ||
is a new API. Third party transformers can opt-in to the API by defining | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do they have to opt-in to respect the global flag? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no backward compatibility concern as long as we don't change the default of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There isn't a backward compatibility concern, but there is an issue around if a third party transformer should respect the global flag. Concretely: sklearn.set_config(transform_output="pandas")
# Should we require this to be a dataframe?
third_party_transformer.transform(X_df) I'm leading toward letting the library decide if it wants to respect the global configuration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exactly that was the case that seemed underspecified to me. I'm ok to leave it up to the library, which means we won't add it to the common tests that this errors out if it's not supported. Not sure if it's worth adding that to the doc as a sentence/half-sentence? Otherwise the doc doesn't really tell the third party estimator authors what they should be doing. |
||
``set_output``. | ||
|
||
Alternatives | ||
------------ | ||
|
||
Alternatives to this SLEP includes: | ||
|
||
1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__ | ||
proposes that if the input is a DataFrame than the output is a DataFrame. | ||
2. Prototype `#20100 | ||
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__ showcases | ||
``array_out="pandas"`` in `transform`. This API is limited because does not | ||
directly support fitting on a pipeline where the steps requires data frames | ||
input. | ||
|
||
Discussion | ||
---------- | ||
|
||
A list of issues discussing Pandas output are: `#14315 | ||
<https://github.com/scikit-learn/scikit-learn/pull/14315>`__, `#20100 | ||
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001 | ||
<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. This SLEP | ||
proposes configuring the output to be pandas because it is the DataFrame library | ||
that is most widely used and requested by users. The ``set_output`` can be | ||
extended to support support additional DataFrame libraries in the future. | ||
|
||
References and Footnotes | ||
------------------------ | ||
|
||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||
domain (see this SLEP as an example) or licensed under the `Open Publication | ||
License`_. | ||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||
|
||
|
||
Copyright | ||
--------- | ||
|
||
This document has been placed in the public domain. [1]_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is sparse data treated? If I do
set_output(transform="pandas")
on aOneHotEncoder(sparse=True)
, will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?For
CountVectorizer
, erroring seems natural, for OHE it seems strange to require the user to set the output intention in two places.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error in both cases.
I think it's strange not to error. If
sparse=True
and a dense pandas array is returned, then it is not consistent withsparse=True
. As long as pandas + sparse is not supported, I prefer the explicitness of configuring in two places.From a workflow point of view, the OHE will be in a pipeline, so it will end up to be one extra configuration (
sparse=True
). This is becauseset_output
will end up being called on pipeline which configures everything:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, in a pipeline it's not so bad. It might be worth calling out in the OHE documentation or in the error message? I agree it's better to be explicit, but it might also be surprising to users who don't know that the default is
sparse=True
, in particular because it might be invisible if used in a column transformer.