Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP018 Pandas output for transformers with set_output #68

Merged
merged 15 commits into from
Jul 17, 2022
Merged
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

slep012/proposal
slep013/proposal
slep018/proposal

.. toctree::
:maxdepth: 1
Expand Down
138 changes: 138 additions & 0 deletions slep018/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
.. _slep_018:

=======================================================
SLEP018: Pandas Output for Transformers with set_output
=======================================================

:Author: Thomas J. Fan
:Status: Draft
:Type: Standards Track
:Created: 2022-06-22

Abstract
--------

This SLEP proposes a ``set_output`` method to configure the output container of
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
scikit-learn transformers.

Detailed description
--------------------

Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse
matrices. This SLEP proposes adding a ``set_output`` method to configure a
transformer to output pandas DataFrames::

scalar = StandardScalar().set_output(transform="pandas")
scalar.fit(X_df)

# X_trans_df is a pandas DataFrame
X_trans_df = scalar.transform(X_df)

The index of the output DataFrame must match the index of the input. If the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is sparse data treated? If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?
For CountVectorizer, erroring seems natural, for OHE it seems strange to require the user to set the output intention in two places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?

Error in both cases.

for OHE it seems strange to require the user to set the output intention in two places.

I think it's strange not to error. If sparse=True and a dense pandas array is returned, then it is not consistent with sparse=True. As long as pandas + sparse is not supported, I prefer the explicitness of configuring in two places.

From a workflow point of view, the OHE will be in a pipeline, so it will end up to be one extra configuration (sparse=True). This is because set_output will end up being called on pipeline which configures everything:

preprocessor = ColumnTransformer([("cat", OneHotEncoder(sparse=True), ...])
pipeline = make_pipeline(preprocessor, ...)
pipeline.set_output(transform="pandas")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, in a pipeline it's not so bad. It might be worth calling out in the OHE documentation or in the error message? I agree it's better to be explicit, but it might also be surprising to users who don't know that the default is sparse=True, in particular because it might be invisible if used in a column transformer.

transformer does not support ``transform="pandas"``, then it must raise a
``ValueError`` stating that it does not support the feature.

For this SLEP, ``set_output`` will only configure the output for dense data. If
the transformer returns sparse data, then ``transform`` will raise a
``ValueError`` if ``set_output(transform="pandas")``.
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved

For a pipeline, calling ``set_output`` on the pipeline will configure all steps
in the pipeline::

num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
num_preprocessor.set_output(transform="pandas")

# X_trans_df is a pandas DataFrame
X_trans_df = num_preprocessor.fit_transform(X_df)

thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
Meta-estimators that support ``set_output`` are required to configure all inner
transformer by calling ``set_output``. If an inner transformer does not define
``set_output``, then an error is raised.

Global Configuration
....................

This SLEP proposes a global configuration flag that sets the output for all
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
transformers::

import sklearn
sklearn.set_config(transform_output="pandas")

The global default configuration is ``"default"`` where the transformer
determines the output container.

Implementation
--------------

The implementation of this SLEP is in :pr:`23734`.
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved

Backward compatibility
----------------------

There are no backward compatibility concerns, because the ``set_output`` method
is a new API. Third party transformers can opt-in to the API by defining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they have to opt-in to respect the global flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no backward compatibility concern as long as we don't change the default of sklearn.set_config(transform_output=XXX), correct?

Copy link
Member Author

@thomasjpfan thomasjpfan Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a backward compatibility concern, but there is an issue around if a third party transformer should respect the global flag. Concretely:

sklearn.set_config(transform_output="pandas")

# Should we require this to be a dataframe?
third_party_transformer.transform(X_df)

I'm leading toward letting the library decide if it wants to respect the global configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly that was the case that seemed underspecified to me. I'm ok to leave it up to the library, which means we won't add it to the common tests that this errors out if it's not supported. Not sure if it's worth adding that to the doc as a sentence/half-sentence? Otherwise the doc doesn't really tell the third party estimator authors what they should be doing.

``set_output``.

Alternatives
------------

Alternatives to this SLEP includes:

1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__
proposes that if the input is a DataFrame than the output is a DataFrame.
2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container for dense
and sparse data that contains feature names. This SLEP also proposes a custom
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sparse data is now in future work, right?

container for sparse data, but pandas for dense data.
3. Prototype `#20100
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__ showcases
``array_out="pandas"`` in `transform`. This API is limited because does not
directly support fitting on a pipeline where the steps requires data frames
input.

Discussion
----------

A list of issues discussing Pandas output are: `#14315
<https://github.com/scikit-learn/scikit-learn/pull/14315>`__, `#20100
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001
<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__.

Future Extensions
-----------------

thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
Sparse Data
...........

The Pandas DataFrame is not suitable to provide column names because it has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Pandas DataFrame is not suitable to provide column names because it has
The Pandas DataFrame is not suitable to provide column names for sparse data because it has

performance issues as shown in `#16772
<https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__.
A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option.
This option will use a scikit-learn specific sparse container that subclasses
SciPy's sparse matrices. This sparse container includes the sparse data, feature
names and index. This enables pipelines with Vectorizers without performance
issues::

pipe = make_pipeline(
CountVectorizer(),
TfidfTransformer(),
LogisticRegression(solver="liblinear")
)
pipe.set_output(transform="pandas_or_namedsparse")

# feature names for logistic regression
pipe[-1].feature_names_in_

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open Publication
License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_