Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP018 Pandas output for transformers with set_output #68

Merged
merged 15 commits into from
Jul 17, 2022
Merged
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

slep012/proposal
slep013/proposal
slep018/proposal

.. toctree::
:maxdepth: 1
Expand Down
136 changes: 136 additions & 0 deletions slep018/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
.. _slep_018:

=======================================================
SLEP018: Pandas Output for Transformers with set_output
=======================================================

:Author: Thomas J. Fan
:Status: Draft
:Type: Standards Track
:Created: 2022-06-22

Abstract
--------

This SLEP proposes a ``set_output`` method to configure the output data container of
scikit-learn transformers.

Detailed description
--------------------

Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse
matrices. This SLEP proposes adding a ``set_output`` method to configure a
transformer to output pandas DataFrames::

scalar = StandardScalar().set_output(transform="pandas")
scalar.fit(X_df)

# X_trans_df is a pandas DataFrame
X_trans_df = scalar.transform(X_df)

The index of the output DataFrame must match the index of the input. If the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is sparse data treated? If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?
For CountVectorizer, erroring seems natural, for OHE it seems strange to require the user to set the output intention in two places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?

Error in both cases.

for OHE it seems strange to require the user to set the output intention in two places.

I think it's strange not to error. If sparse=True and a dense pandas array is returned, then it is not consistent with sparse=True. As long as pandas + sparse is not supported, I prefer the explicitness of configuring in two places.

From a workflow point of view, the OHE will be in a pipeline, so it will end up to be one extra configuration (sparse=True). This is because set_output will end up being called on pipeline which configures everything:

preprocessor = ColumnTransformer([("cat", OneHotEncoder(sparse=True), ...])
pipeline = make_pipeline(preprocessor, ...)
pipeline.set_output(transform="pandas")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, in a pipeline it's not so bad. It might be worth calling out in the OHE documentation or in the error message? I agree it's better to be explicit, but it might also be surprising to users who don't know that the default is sparse=True, in particular because it might be invisible if used in a column transformer.

transformer does not support ``transform="pandas"``, then it must raise a
``ValueError`` stating that it does not support the feature.

This SLEP's only focus is dense data for ``set_output``. If a transformer returns
sparse data, e.g. `OneHotEncoder(sparse=True), then ``transform`` will raise a
``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output
might be the scope of another future SLEP.

For a pipeline, calling ``set_output`` on the pipeline will configure all steps
in the pipeline::

num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
num_preprocessor.set_output(transform="pandas")

# X_trans_df is a pandas DataFrame
X_trans_df = num_preprocessor.fit_transform(X_df)

thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
# X_trans_df is again a pandas DataFrame
X_trans_df = num_preprocessor[0].transform(X_df)

Meta-estimators that support ``set_output`` are required to configure all inner
transformer by calling ``set_output``. Specifically all fitted and non-fitted
inner transformers must be configured with ``set_output``. This enables
``transform``'s output to be a DataFrame before and after the meta-estimator is
fitted. If an inner transformer does not define ``set_output``, then an error is
raised.


Global Configuration
....................

For ease of use, this SLEP proposes a global configuration flag that sets the output for all
transformers::

import sklearn
sklearn.set_config(transform_output="pandas")

The global default configuration is ``"default"`` where the transformer
determines the output container.

The configuration can also be set locally using the ``config_context`` context
manager:

from sklearn import config_context
with config_context(transform_output="pandas"):
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
num_preprocessor.fit_transform(X_df)

The following specifies the precedence levels for the three ways to configure
the output container:

1. Locally configure a transformer: ``transformer.set_output``
2. Context manager: ``config_context``
3. Global configuration: ``set_config``

Implementation
--------------

A possible implementation of this SLEP is worked out in :pr:`23734`.

Backward compatibility
----------------------

There are no backward compatibility concerns, because the ``set_output`` method
is a new API. Third party transformers can opt-in to the API by defining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they have to opt-in to respect the global flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no backward compatibility concern as long as we don't change the default of sklearn.set_config(transform_output=XXX), correct?

Copy link
Member Author

@thomasjpfan thomasjpfan Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a backward compatibility concern, but there is an issue around if a third party transformer should respect the global flag. Concretely:

sklearn.set_config(transform_output="pandas")

# Should we require this to be a dataframe?
third_party_transformer.transform(X_df)

I'm leading toward letting the library decide if it wants to respect the global configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly that was the case that seemed underspecified to me. I'm ok to leave it up to the library, which means we won't add it to the common tests that this errors out if it's not supported. Not sure if it's worth adding that to the doc as a sentence/half-sentence? Otherwise the doc doesn't really tell the third party estimator authors what they should be doing.

``set_output``.

Alternatives
------------

Alternatives to this SLEP includes:

1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__
proposes that if the input is a DataFrame than the output is a DataFrame.
2. Prototype `#20100
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__ showcases
``array_out="pandas"`` in `transform`. This API is limited because does not
directly support fitting on a pipeline where the steps requires data frames
input.

Discussion
----------

A list of issues discussing Pandas output are: `#14315
<https://github.com/scikit-learn/scikit-learn/pull/14315>`__, `#20100
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001
<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. This SLEP
proposes configuring the output to be pandas because it is the DataFrame library
that is most widely used and requested by users. The ``set_output`` can be
extended to support support additional DataFrame libraries in the future.

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open Publication
License`_.
.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_