SLEP018 Pandas output for transformers with set_output #68

thomasjpfan · 2022-05-26T22:53:50Z

This SLEP proposes a set_output API used to configure the output of transform. The overall idea is to use set_output(transform="pandas_or_sparse"), which outputs a pandas dataframe for dense data and a scipy sparse matrix for sparse data.

Use cases

I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:

DataFrame output from a Single transformer
Column Transformer with DataFrame output
Feature selection based on column names with cross validation
Using HistGradientBoosting to select categories based on dtype
Text preprocessing with sparse data

Future Extensions

The Pandas DataFrame is not suitable to provide column names for sparse data because it has performance issues as shown in #16772. A future extension to this SLEP is to have a "pandas_or_namedsparse" option. This option will use a scikit-learn specific sparse container that subclasses SciPy's sparse matrices. This sparse container includes the sparse data, feature names and index. This enables pipelines with Vectorizers without performance issues::

pipe = make_pipeline(
   CountVectorizer(),
   TfidfTransformer(),
   LogisticRegression(solver="liblinear")
)
pipe.set_output(transform="pandas_or_namedsparse")

# feature names for logistic regression
pipe[-1].feature_names_in_

CC @amueller @glemaitre @lorentzenchr

ogrisel · 2022-05-30T12:47:14Z

If we introduce new (internal) containers to add named columns to sparse matrices, we could even have a container that has several column blocks of different types (e.g. pandas dataframe and sparse matrices, possibly with different dtypes), for instance to store the output of a column transformer without any a priori data conversion and this would allow the downstream estimator in the pipeline to materialize this input in an optimal way.

Some column wise estimators such as coordinate-descent linear models and tree models could even accept data with mixed column blocks representation.

thomasjpfan · 2022-05-30T14:47:53Z

I think it's possible to have a new custom container that contains several blocks. It can also adopt the dataframe interchange protocol which allows the custom container to be converted to dataframes in other libraries.

Dataframe libraries such as: https://github.com/rapidsai/cudf/pull/90710, pandas-dev/pandas#46141, vaexio/vaex#1509, modin-project/modin#4269 have adopted the protocol.

There are two underlying issues with the protocol:

Sparse columns as you noted in Sparse columns data-apis/dataframe-api#55
Custom dataframe container backed by a C-ordered ndarray. According to the specification, the column buffers must be contiguous. XREF: How to consume a single buffer & connection to array interchange data-apis/dataframe-api#39

In both cases, we can still store the blocks as CSR and C-ordered ndarrays, but when __dataframe__ is called a copy must be made. This is exactly what pandas does in it's dataframe protocol implementation.

This SLEP

As noted in the dev meeting, this SLEP will focus on set_output and pandas output. If we were to create a custom container, then we can use the same API of set_output to configure the transformer.

thomasjpfan · 2022-06-12T22:04:53Z

During the monthly developer meeting we decided to remove the "namedtensor" from this SLEP and stick with transform="pandas" for now. I am debating between two APIs for sparse data:

transform="pandas", which will use Panda's Sparse extension arrays for sparse data, which is inefficient. The pro is that the feature names are passed through.
transform="pandas_or_sparse", which will output a SciPy sparse matrix for sparse data, which is more efficient. The con is that the feature names are not passed through.

The final solution is to have transform="pandas_or_namedsparse" where we have a custom sparse container with column names, which was the initial version of this SLEP. But since we want to reduce the scope, I am okay with option 1, 2 or both.

adrinjalali · 2022-06-13T10:09:08Z

An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation, and only add it as an improvement later with a separate discussion thread.

thomasjpfan · 2022-06-14T01:24:05Z

An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation

I interpreted "not support sparse" to mean "not support sparse with feature names". Transformers will still be returning SciPy sparse data.

From an API point of view, setting set_output="pandas" but returning a SciPy sparse matrix feels a little counterintuitive. This lead me down the two paths in #68 (comment). set_output="pandas_or_sparse" means dense data becomes a DataFrame and SciPy sparse remains the same.

To move this forward, I updated the SLEP to say that set_output="pandas" only configures the output for dense data. In the future, we can extend it to configure sparse data using set_output="pandas_or_namedsparse", which is backward compatible.

The API will look a little strange if a user acutally wants the Pandas Sparse DataFrame. In that case, set_output="pandas_or_pandas" would configure both dense and sparse to return a pandas DataFrame.

thomasjpfan · 2022-06-14T01:26:35Z

Thinking it over, the remaining option is to error when set_output="pandas" and the output is sparse. When it comes to expanding the API, we do not have to deal with the behavior described in #68 (comment)

I updated the SLEP to error if set_output="pandas" and the output is sparse.

jnothman

Thank you

slep018/proposal.rst

jnothman · 2022-06-14T03:21:42Z

slep018/proposal.rst

+is to "pandas". If a transformer always returns sparse data, then calling
+`set_output="pandas"` may raise an error.
+
+For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the


Are there meaningful cases with an intermediate sparse representation (which would not really call for sparse data with column names)???

In the context of Pipeline, the only use case I think can of is if one has not use of feature names after the step that returns the sparse matrix without feature names.

For ColumnTransformer with a OneHotEncoder that outputs sparse data, OneHotEncoder is not required to output a "named sparse matrix" because ColumnTransformer can figure it out by calling get_feature_names_out.

If we want to allow for intermediate sparse representations without names, there is a future extension to this SLEP with set_output="pandas_or_sparse". This configures dense container to be dataframe, sparse container to be SciPy sparse.

jnothman · 2022-06-14T03:29:32Z

slep018/proposal.rst

+   # X_trans_df is a pandas DataFrame
+   X_trans_df = num_preprocessor.fit_transform(X_df)
+
+Meta-estimators that support ``set_output`` are required to configure all estimators


All estimators? All transformers?

In the context of this SLEP, it does make sense to reduce the scope to "all transformers". I was thinking about a future where we had set_output(predict="pandas") for non-transformers.

jnothman · 2022-06-14T03:30:22Z

slep018/proposal.rst

+   X_trans_df = num_preprocessor.fit_transform(X_df)
+
+Meta-estimators that support ``set_output`` are required to configure all estimators
+by calling ``set_output``.


What is the procedure when this attribute is unavailable on a child estimator? Presumably the child should raise a ValueError if the value "pandas" is not supported. What should the parent do then?

If set_output is not defined for the child estimator, then the parent errors. If the child does not support "pandas", then the child should error.

In both cases, the parent is not properly configured because one of the children failed the set_output call. For a user point of view, they can not use set_output for their meta-estimator.

Should this case be in the slep?

Part of this is is in the SLEP here: https://github.com/thomasjpfan/enhancement_proposals/blob/218b76ad5612663ab9408dd4d824df357dda4d05/slep018/proposal.rst?plain=1#L76-L78 but it makes sense to move it up to this section.

Co-authored-by: Joel Nothman <[email protected]>

ogrisel

A few comments:

This this is the first time we introduce new public methods to set state on estimators (beyond set_params) and since we anticipate that we will need other output containers to support sparse outputs (and maybe also reuse this API to control for GPU allocated output containers), I would like to go through an experimental period where we let the user know that this API is subject to change without going through a deprecation cycle. Not sure if we need to make this part of the SLEP or just the implementation though.
I think the SLEP should specify that it is also possible configure the transformers using the with config_context(output_transform="pandas") context manager as an alternative to global configuration and local set_output calls. We should also specify the precedence: calling set_output explicitly overrides any configuration while with config_context locally overrides the global configuration previously defined with sklearn.set_config.

lorentzenchr · 2022-06-28T20:59:30Z

I would like to go through an experimental period

+1. For me, it's fine to implement it this way without having it mentioned in the SLEP. However, if it helps to get agreement and voting faster, we can add it.
In general, I find it a good idea to introduce great new features as experimental to have the chance to get them in, get experience with them, have some learnings and improve.

jnothman · 2022-06-29T13:39:52Z

This this is the first time we introduce new public methods to set state on estimators (beyond set_params)

What about SLEP006?

adrinjalali · 2022-06-29T13:41:45Z

Came here to say what @jnothman said.

thomasjpfan · 2022-06-29T14:46:19Z

Yes, set_*_requests is the first accepted SLEP that configures the estimator state beyond set_params and __init__. Although, I think the comment still holds and set_output can be marked as experimental. It could be worth doing the same thing for set_*_requests, so we can make changes to it without deprecation.

As for me, I am happy with marking set_output as experimental.

adrinjalali · 2022-06-29T15:08:32Z

Making set_*_requests as experimental would really complicate things, since it'd mean the whole metadata routing would be experimental. Which means we'd have the experimental phase on top of the deprecation phase.

ogrisel · 2022-06-29T15:56:49Z

Making set_*_requests as experimental would really complicate things, since it'd mean the whole metadata routing would be experimental. Which means we'd have the experimental phase on top of the deprecation phase.

We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself.

adrinjalali · 2022-06-30T13:42:39Z

We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself.

If we tell users it's experimental, we should give them an option to use the library w/o the experimental features and w/o getting deprecation warnings, which is not really possible, cause if they don't use the experimental feature, they'll get the deprecation warning.

If we want to tell them it's experimental, then we'd have to allow them to pass things the old way w/o warning them, and then after the experimental phase is over, then the deprecation cycle begins. I really don't wanna go down that route.

lorentzenchr · 2022-06-30T14:47:27Z

Let‘s discuss here SLEP018 only.
If there are no such intricacies as for the props/metadata case, starting with experimental makes sense to me.

amueller · 2022-06-30T23:11:15Z

@ogrisel I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it?
Maybe that's ok, and if users complain and/or it's unworkable we can figure out something else. But I think in particular for notebooks, people always want dataframes and they'll just put the global config flag at the top of their notebooks with their imports, and we should enable that.

amueller · 2022-06-30T23:08:29Z

slep018/proposal.rst

+----------------------
+
+There are no backward compatibility concerns, because the ``set_output`` method
+is a new API. Third party transformers can opt-in to the API by defining


Do they have to opt-in to respect the global flag?

There is no backward compatibility concern as long as we don't change the default of sklearn.set_config(transform_output=XXX), correct?

There isn't a backward compatibility concern, but there is an issue around if a third party transformer should respect the global flag. Concretely:

sklearn.set_config(transform_output="pandas") # Should we require this to be a dataframe? third_party_transformer.transform(X_df)

I'm leading toward letting the library decide if it wants to respect the global configuration.

Exactly that was the case that seemed underspecified to me. I'm ok to leave it up to the library, which means we won't add it to the common tests that this errors out if it's not supported. Not sure if it's worth adding that to the doc as a sentence/half-sentence? Otherwise the doc doesn't really tell the third party estimator authors what they should be doing.

amueller · 2022-06-30T23:09:26Z

slep018/proposal.rst

+1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__
+   proposes that if the input is a DataFrame than the output is a DataFrame.
+2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container for dense
+   and sparse data that contains feature names. This SLEP also proposes a custom


The sparse data is now in future work, right?

amueller · 2022-06-30T23:09:46Z

slep018/proposal.rst

+Sparse Data
+...........
+
+The Pandas DataFrame is not suitable to provide column names because it has


Suggested change

The Pandas DataFrame is not suitable to provide column names because it has

The Pandas DataFrame is not suitable to provide column names for sparse data because it has

amueller · 2022-06-30T23:13:38Z

slep018/proposal.rst

+   # X_trans_df is a pandas DataFrame
+   X_trans_df = scalar.transform(X_df)
+
+The index of the output DataFrame must match the index of the input. If the


How is sparse data treated? If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?
For CountVectorizer, erroring seems natural, for OHE it seems strange to require the user to set the output intention in two places.

If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?

Error in both cases.

for OHE it seems strange to require the user to set the output intention in two places.

I think it's strange not to error. If sparse=True and a dense pandas array is returned, then it is not consistent with sparse=True. As long as pandas + sparse is not supported, I prefer the explicitness of configuring in two places.

From a workflow point of view, the OHE will be in a pipeline, so it will end up to be one extra configuration (sparse=True). This is because set_output will end up being called on pipeline which configures everything:

preprocessor = ColumnTransformer([("cat", OneHotEncoder(sparse=True), ...]) pipeline = make_pipeline(preprocessor, ...) pipeline.set_output(transform="pandas")

You're right, in a pipeline it's not so bad. It might be worth calling out in the OHE documentation or in the error message? I agree it's better to be explicit, but it might also be surprising to users who don't know that the default is sparse=True, in particular because it might be invisible if used in a column transformer.

slep018/proposal.rst

lorentzenchr · 2022-07-03T14:48:19Z

slep018/proposal.rst

+----------------------
+
+There are no backward compatibility concerns, because the ``set_output`` method
+is a new API. Third party transformers can opt-in to the API by defining


There is no backward compatibility concern as long as we don't change the default of sklearn.set_config(transform_output=XXX), correct?

slep018/proposal.rst

Co-authored-by: Christian Lorentzen <[email protected]>

thomasjpfan · 2022-07-05T17:17:33Z

I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it?

Having a global configuration does introduce this inconsistency with third party estimators. I prefer not require it from third party estimators, but it would be a better UX if they choose to use the global configuration.

Maybe that's ok, and if users complain and/or it's unworkable we can figure out something else.

If users complain, then we direct them to the third party's repo? We already do this sometimes, if a library does not define a scikit-learn compatible estimator. I think third party libraries will be motivated to use the global configuration anyways becomes of the better UX.

lorentzenchr · 2022-07-07T18:27:25Z

I think, this SLEP should motivate why it considers pandas and not any other dataframe (libraries such as pyarrow, polars, …) as data container.

thomasjpfan · 2022-07-08T20:19:45Z

I updated the PR:

1d57415 (#68): Adds details about meta-estimators and fitted and non-fitted inner transformers.
68ada33: Adds details on why pandas and also the possibility of supporting other DataFrame libraries. (Technically, I see a few options to extend set_output so users can configure transform to return another DataFrame implementation, but this is outside the scope of this SLEP)

SLEP018 Pandas output for transformers

f140b8a

thomasjpfan mentioned this pull request Jun 9, 2022

retain original indexes scikit-learn/scikit-learn#8238

Closed

thomasjpfan added 2 commits June 9, 2022 16:15

DOC Reorder for sparse data

e708c2a

DOC Be more explicit about behavior

7d1528e

thomasjpfan added 2 commits June 13, 2022 21:12

ENH set_output does nothing for sparse data

1186280

DOC Wording

ac21785

thomasjpfan added 2 commits June 13, 2022 21:58

DOC Reword

326fcbb

DOC Adds set_output validation

d91fb3a

jnothman reviewed Jun 14, 2022

View reviewed changes

Update slep018/proposal.rst

218b76a

Co-authored-by: Joel Nothman <[email protected]>

thomasjpfan mentioned this pull request Jun 21, 2022

Pipeline does not correctly set .feature_names_in_ attribute on classifier scikit-learn/scikit-learn#23685

Closed

CLN Address comments

2d23470

thomasjpfan mentioned this pull request Jun 23, 2022

ENH Introduces set_output API for pandas output scikit-learn/scikit-learn#23734

Merged

DOC Link to implementation

eac9f9f

ogrisel reviewed Jun 28, 2022

View reviewed changes

adrinjalali closed this Jun 29, 2022

adrinjalali reopened this Jun 29, 2022

amueller reviewed Jun 30, 2022

View reviewed changes

lorentzenchr approved these changes Jul 3, 2022

View reviewed changes

thomasjpfan and others added 3 commits July 5, 2022 11:42

Apply suggestions from code review

add7a8a

Co-authored-by: Christian Lorentzen <[email protected]>

DOC Address comments

0580391

DOC Remove future extensions

4370736

thomasjpfan added 2 commits July 8, 2022 16:12

DOC Adds details about fitted and non-fitted inner transformers

1d57415

DOC Adds details about the pandas choice

68ada33

amueller approved these changes Jul 17, 2022

View reviewed changes

amueller merged commit 5bd1c9c into scikit-learn:master Jul 17, 2022

SLEP018 Pandas output for transformers with set_output #68

SLEP018 Pandas output for transformers with set_output #68

Conversation

thomasjpfan commented May 26, 2022 • edited Loading

Use cases

Future Extensions

ogrisel commented May 30, 2022

thomasjpfan commented May 30, 2022 • edited Loading

This SLEP

thomasjpfan commented Jun 12, 2022 • edited Loading

adrinjalali commented Jun 13, 2022

thomasjpfan commented Jun 14, 2022 • edited Loading

thomasjpfan commented Jun 14, 2022 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

lorentzenchr commented Jun 28, 2022

jnothman commented Jun 29, 2022

adrinjalali commented Jun 29, 2022

thomasjpfan commented Jun 29, 2022

adrinjalali commented Jun 29, 2022

ogrisel commented Jun 29, 2022

adrinjalali commented Jun 30, 2022

lorentzenchr commented Jun 30, 2022

amueller commented Jun 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Jul 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Jul 5, 2022

lorentzenchr commented Jul 7, 2022

thomasjpfan commented Jul 8, 2022 • edited Loading

thomasjpfan commented May 26, 2022 •

edited

Loading

thomasjpfan commented May 30, 2022 •

edited

Loading

thomasjpfan commented Jun 12, 2022 •

edited

Loading

thomasjpfan commented Jun 14, 2022 •

edited

Loading

thomasjpfan commented Jun 14, 2022 •

edited

Loading

thomasjpfan Jun 21, 2022 •

edited

Loading

thomasjpfan Jul 5, 2022 •

edited

Loading

thomasjpfan commented Jul 8, 2022 •

edited

Loading