Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP018 Pandas output for transformers with set_output #68

Merged
merged 15 commits into from
Jul 17, 2022

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented May 26, 2022

This SLEP proposes a set_output API used to configure the output of transform. The overall idea is to use set_output(transform="pandas_or_sparse"), which outputs a pandas dataframe for dense data and a scipy sparse matrix for sparse data.

Use cases

I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:

  • DataFrame output from a Single transformer
  • Column Transformer with DataFrame output
  • Feature selection based on column names with cross validation
  • Using HistGradientBoosting to select categories based on dtype
  • Text preprocessing with sparse data

Future Extensions

The Pandas DataFrame is not suitable to provide column names for sparse data because it has performance issues as shown in #16772. A future extension to this SLEP is to have a "pandas_or_namedsparse" option. This option will use a scikit-learn specific sparse container that subclasses SciPy's sparse matrices. This sparse container includes the sparse data, feature names and index. This enables pipelines with Vectorizers without performance issues::

pipe = make_pipeline(
   CountVectorizer(),
   TfidfTransformer(),
   LogisticRegression(solver="liblinear")
)
pipe.set_output(transform="pandas_or_namedsparse")

# feature names for logistic regression
pipe[-1].feature_names_in_

CC @amueller @glemaitre @lorentzenchr

@ogrisel
Copy link
Member

ogrisel commented May 30, 2022

If we introduce new (internal) containers to add named columns to sparse matrices, we could even have a container that has several column blocks of different types (e.g. pandas dataframe and sparse matrices, possibly with different dtypes), for instance to store the output of a column transformer without any a priori data conversion and this would allow the downstream estimator in the pipeline to materialize this input in an optimal way.

Some column wise estimators such as coordinate-descent linear models and tree models could even accept data with mixed column blocks representation.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented May 30, 2022

I think it's possible to have a new custom container that contains several blocks. It can also adopt the dataframe interchange protocol which allows the custom container to be converted to dataframes in other libraries.

Dataframe libraries such as: https://github.com/rapidsai/cudf/pull/90710, pandas-dev/pandas#46141, vaexio/vaex#1509, modin-project/modin#4269 have adopted the protocol.

There are two underlying issues with the protocol:

  1. Sparse columns as you noted in Sparse columns data-apis/dataframe-api#55
  2. Custom dataframe container backed by a C-ordered ndarray. According to the specification, the column buffers must be contiguous. XREF: How to consume a single buffer & connection to array interchange data-apis/dataframe-api#39

In both cases, we can still store the blocks as CSR and C-ordered ndarrays, but when __dataframe__ is called a copy must be made. This is exactly what pandas does in it's dataframe protocol implementation.

This SLEP

As noted in the dev meeting, this SLEP will focus on set_output and pandas output. If we were to create a custom container, then we can use the same API of set_output to configure the transformer.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jun 12, 2022

During the monthly developer meeting we decided to remove the "namedtensor" from this SLEP and stick with transform="pandas" for now. I am debating between two APIs for sparse data:

  1. transform="pandas", which will use Panda's Sparse extension arrays for sparse data, which is inefficient. The pro is that the feature names are passed through.
  2. transform="pandas_or_sparse", which will output a SciPy sparse matrix for sparse data, which is more efficient. The con is that the feature names are not passed through.

The final solution is to have transform="pandas_or_namedsparse" where we have a custom sparse container with column names, which was the initial version of this SLEP. But since we want to reduce the scope, I am okay with option 1, 2 or both.

@adrinjalali
Copy link
Member

An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation, and only add it as an improvement later with a separate discussion thread.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jun 14, 2022

An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation

I interpreted "not support sparse" to mean "not support sparse with feature names". Transformers will still be returning SciPy sparse data.

From an API point of view, setting set_output="pandas" but returning a SciPy sparse matrix feels a little counterintuitive. This lead me down the two paths in #68 (comment). set_output="pandas_or_sparse" means dense data becomes a DataFrame and SciPy sparse remains the same.

To move this forward, I updated the SLEP to say that set_output="pandas" only configures the output for dense data. In the future, we can extend it to configure sparse data using set_output="pandas_or_namedsparse", which is backward compatible.

The API will look a little strange if a user acutally wants the Pandas Sparse DataFrame. In that case, set_output="pandas_or_pandas" would configure both dense and sparse to return a pandas DataFrame.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jun 14, 2022

Thinking it over, the remaining option is to error when set_output="pandas" and the output is sparse. When it comes to expanding the API, we do not have to deal with the behavior described in #68 (comment)

I updated the SLEP to error if set_output="pandas" and the output is sparse.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

slep018/proposal.rst Outdated Show resolved Hide resolved
is to "pandas". If a transformer always returns sparse data, then calling
`set_output="pandas"` may raise an error.

For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there meaningful cases with an intermediate sparse representation (which would not really call for sparse data with column names)???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of Pipeline, the only use case I think can of is if one has not use of feature names after the step that returns the sparse matrix without feature names.

For ColumnTransformer with a OneHotEncoder that outputs sparse data, OneHotEncoder is not required to output a "named sparse matrix" because ColumnTransformer can figure it out by calling get_feature_names_out.

If we want to allow for intermediate sparse representations without names, there is a future extension to this SLEP with set_output="pandas_or_sparse". This configures dense container to be dataframe, sparse container to be SciPy sparse.

# X_trans_df is a pandas DataFrame
X_trans_df = num_preprocessor.fit_transform(X_df)

Meta-estimators that support ``set_output`` are required to configure all estimators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All estimators? All transformers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of this SLEP, it does make sense to reduce the scope to "all transformers". I was thinking about a future where we had set_output(predict="pandas") for non-transformers.

X_trans_df = num_preprocessor.fit_transform(X_df)

Meta-estimators that support ``set_output`` are required to configure all estimators
by calling ``set_output``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the procedure when this attribute is unavailable on a child estimator? Presumably the child should raise a ValueError if the value "pandas" is not supported. What should the parent do then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set_output is not defined for the child estimator, then the parent errors. If the child does not support "pandas", then the child should error.

In both cases, the parent is not properly configured because one of the children failed the set_output call. For a user point of view, they can not use set_output for their meta-estimator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this case be in the slep?

Copy link
Member Author

@thomasjpfan thomasjpfan Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: Joel Nothman <[email protected]>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  • This this is the first time we introduce new public methods to set state on estimators (beyond set_params) and since we anticipate that we will need other output containers to support sparse outputs (and maybe also reuse this API to control for GPU allocated output containers), I would like to go through an experimental period where we let the user know that this API is subject to change without going through a deprecation cycle. Not sure if we need to make this part of the SLEP or just the implementation though.

  • I think the SLEP should specify that it is also possible configure the transformers using the with config_context(output_transform="pandas") context manager as an alternative to global configuration and local set_output calls. We should also specify the precedence: calling set_output explicitly overrides any configuration while with config_context locally overrides the global configuration previously defined with sklearn.set_config.

@lorentzenchr
Copy link
Member

I would like to go through an experimental period

+1. For me, it's fine to implement it this way without having it mentioned in the SLEP. However, if it helps to get agreement and voting faster, we can add it.
In general, I find it a good idea to introduce great new features as experimental to have the chance to get them in, get experience with them, have some learnings and improve.

@jnothman
Copy link
Member

This this is the first time we introduce new public methods to set state on estimators (beyond set_params)

What about SLEP006?

@adrinjalali
Copy link
Member

Came here to say what @jnothman said.

@thomasjpfan
Copy link
Member Author

Yes, set_*_requests is the first accepted SLEP that configures the estimator state beyond set_params and __init__. Although, I think the comment still holds and set_output can be marked as experimental. It could be worth doing the same thing for set_*_requests, so we can make changes to it without deprecation.

As for me, I am happy with marking set_output as experimental.

@adrinjalali
Copy link
Member

Making set_*_requests as experimental would really complicate things, since it'd mean the whole metadata routing would be experimental. Which means we'd have the experimental phase on top of the deprecation phase.

@adrinjalali adrinjalali reopened this Jun 29, 2022
@ogrisel
Copy link
Member

ogrisel commented Jun 29, 2022

Making set_*_requests as experimental would really complicate things, since it'd mean the whole metadata routing would be experimental. Which means we'd have the experimental phase on top of the deprecation phase.

We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself.

@adrinjalali
Copy link
Member

We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself.

If we tell users it's experimental, we should give them an option to use the library w/o the experimental features and w/o getting deprecation warnings, which is not really possible, cause if they don't use the experimental feature, they'll get the deprecation warning.

If we want to tell them it's experimental, then we'd have to allow them to pass things the old way w/o warning them, and then after the experimental phase is over, then the deprecation cycle begins. I really don't wanna go down that route.

@lorentzenchr
Copy link
Member

Let‘s discuss here SLEP018 only.
If there are no such intricacies as for the props/metadata case, starting with experimental makes sense to me.

@amueller
Copy link
Member

@ogrisel I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it?
Maybe that's ok, and if users complain and/or it's unworkable we can figure out something else. But I think in particular for notebooks, people always want dataframes and they'll just put the global config flag at the top of their notebooks with their imports, and we should enable that.

----------------------

There are no backward compatibility concerns, because the ``set_output`` method
is a new API. Third party transformers can opt-in to the API by defining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they have to opt-in to respect the global flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no backward compatibility concern as long as we don't change the default of sklearn.set_config(transform_output=XXX), correct?

Copy link
Member Author

@thomasjpfan thomasjpfan Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a backward compatibility concern, but there is an issue around if a third party transformer should respect the global flag. Concretely:

sklearn.set_config(transform_output="pandas")

# Should we require this to be a dataframe?
third_party_transformer.transform(X_df)

I'm leading toward letting the library decide if it wants to respect the global configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly that was the case that seemed underspecified to me. I'm ok to leave it up to the library, which means we won't add it to the common tests that this errors out if it's not supported. Not sure if it's worth adding that to the doc as a sentence/half-sentence? Otherwise the doc doesn't really tell the third party estimator authors what they should be doing.

1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__
proposes that if the input is a DataFrame than the output is a DataFrame.
2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container for dense
and sparse data that contains feature names. This SLEP also proposes a custom
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sparse data is now in future work, right?

Sparse Data
...........

The Pandas DataFrame is not suitable to provide column names because it has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Pandas DataFrame is not suitable to provide column names because it has
The Pandas DataFrame is not suitable to provide column names for sparse data because it has

# X_trans_df is a pandas DataFrame
X_trans_df = scalar.transform(X_df)

The index of the output DataFrame must match the index of the input. If the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is sparse data treated? If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?
For CountVectorizer, erroring seems natural, for OHE it seems strange to require the user to set the output intention in two places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?

Error in both cases.

for OHE it seems strange to require the user to set the output intention in two places.

I think it's strange not to error. If sparse=True and a dense pandas array is returned, then it is not consistent with sparse=True. As long as pandas + sparse is not supported, I prefer the explicitness of configuring in two places.

From a workflow point of view, the OHE will be in a pipeline, so it will end up to be one extra configuration (sparse=True). This is because set_output will end up being called on pipeline which configures everything:

preprocessor = ColumnTransformer([("cat", OneHotEncoder(sparse=True), ...])
pipeline = make_pipeline(preprocessor, ...)
pipeline.set_output(transform="pandas")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, in a pipeline it's not so bad. It might be worth calling out in the OHE documentation or in the error message? I agree it's better to be explicit, but it might also be surprising to users who don't know that the default is sparse=True, in particular because it might be invisible if used in a column transformer.

slep018/proposal.rst Outdated Show resolved Hide resolved
slep018/proposal.rst Outdated Show resolved Hide resolved
slep018/proposal.rst Show resolved Hide resolved
slep018/proposal.rst Outdated Show resolved Hide resolved
slep018/proposal.rst Outdated Show resolved Hide resolved
----------------------

There are no backward compatibility concerns, because the ``set_output`` method
is a new API. Third party transformers can opt-in to the API by defining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no backward compatibility concern as long as we don't change the default of sklearn.set_config(transform_output=XXX), correct?

slep018/proposal.rst Outdated Show resolved Hide resolved
@thomasjpfan
Copy link
Member Author

I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it?

Having a global configuration does introduce this inconsistency with third party estimators. I prefer not require it from third party estimators, but it would be a better UX if they choose to use the global configuration.

Maybe that's ok, and if users complain and/or it's unworkable we can figure out something else.

If users complain, then we direct them to the third party's repo? We already do this sometimes, if a library does not define a scikit-learn compatible estimator. I think third party libraries will be motivated to use the global configuration anyways becomes of the better UX.

@lorentzenchr
Copy link
Member

I think, this SLEP should motivate why it considers pandas and not any other dataframe (libraries such as pyarrow, polars, …) as data container.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jul 8, 2022

I updated the PR:

  • 1d57415 (#68): Adds details about meta-estimators and fitted and non-fitted inner transformers.
  • 68ada33: Adds details on why pandas and also the possibility of supporting other DataFrame libraries. (Technically, I see a few options to extend set_output so users can configure transform to return another DataFrame implementation, but this is outside the scope of this SLEP)

@amueller amueller merged commit 5bd1c9c into scikit-learn:master Jul 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants