-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP018 Pandas output for transformers with set_output #68
Conversation
If we introduce new (internal) containers to add named columns to sparse matrices, we could even have a container that has several column blocks of different types (e.g. pandas dataframe and sparse matrices, possibly with different dtypes), for instance to store the output of a column transformer without any a priori data conversion and this would allow the downstream estimator in the pipeline to materialize this input in an optimal way. Some column wise estimators such as coordinate-descent linear models and tree models could even accept data with mixed column blocks representation. |
I think it's possible to have a new custom container that contains several blocks. It can also adopt the dataframe interchange protocol which allows the custom container to be converted to dataframes in other libraries. Dataframe libraries such as: https://github.com/rapidsai/cudf/pull/90710, pandas-dev/pandas#46141, vaexio/vaex#1509, modin-project/modin#4269 have adopted the protocol. There are two underlying issues with the protocol:
In both cases, we can still store the blocks as CSR and C-ordered ndarrays, but when This SLEPAs noted in the dev meeting, this SLEP will focus on |
During the monthly developer meeting we decided to remove the "namedtensor" from this SLEP and stick with
The final solution is to have |
An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation, and only add it as an improvement later with a separate discussion thread. |
I interpreted "not support sparse" to mean "not support sparse with feature names". Transformers will still be returning SciPy sparse data. From an API point of view, setting To move this forward, I updated the SLEP to say that The API will look a little strange if a user acutally wants the Pandas Sparse DataFrame. In that case, |
Thinking it over, the remaining option is to error when I updated the SLEP to error if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
slep018/proposal.rst
Outdated
is to "pandas". If a transformer always returns sparse data, then calling | ||
`set_output="pandas"` may raise an error. | ||
|
||
For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there meaningful cases with an intermediate sparse representation (which would not really call for sparse data with column names)???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of Pipeline
, the only use case I think can of is if one has not use of feature names after the step
that returns the sparse matrix without feature names.
For ColumnTransformer
with a OneHotEncoder
that outputs sparse data, OneHotEncoder
is not required to output a "named sparse matrix" because ColumnTransformer
can figure it out by calling get_feature_names_out
.
If we want to allow for intermediate sparse representations without names, there is a future extension to this SLEP with set_output="pandas_or_sparse"
. This configures dense container to be dataframe, sparse container to be SciPy sparse.
slep018/proposal.rst
Outdated
# X_trans_df is a pandas DataFrame | ||
X_trans_df = num_preprocessor.fit_transform(X_df) | ||
|
||
Meta-estimators that support ``set_output`` are required to configure all estimators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All estimators? All transformers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of this SLEP, it does make sense to reduce the scope to "all transformers". I was thinking about a future where we had set_output(predict="pandas")
for non-transformers.
slep018/proposal.rst
Outdated
X_trans_df = num_preprocessor.fit_transform(X_df) | ||
|
||
Meta-estimators that support ``set_output`` are required to configure all estimators | ||
by calling ``set_output``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the procedure when this attribute is unavailable on a child estimator? Presumably the child should raise a ValueError
if the value "pandas" is not supported. What should the parent do then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If set_output
is not defined for the child estimator, then the parent errors. If the child does not support "pandas", then the child should error.
In both cases, the parent is not properly configured because one of the children failed the set_output
call. For a user point of view, they can not use set_output
for their meta-estimator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this case be in the slep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of this is is in the SLEP here: https://github.com/thomasjpfan/enhancement_proposals/blob/218b76ad5612663ab9408dd4d824df357dda4d05/slep018/proposal.rst?plain=1#L76-L78 but it makes sense to move it up to this section.
Co-authored-by: Joel Nothman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments:
-
This this is the first time we introduce new public methods to set state on estimators (beyond
set_params
) and since we anticipate that we will need other output containers to support sparse outputs (and maybe also reuse this API to control for GPU allocated output containers), I would like to go through an experimental period where we let the user know that this API is subject to change without going through a deprecation cycle. Not sure if we need to make this part of the SLEP or just the implementation though. -
I think the SLEP should specify that it is also possible configure the transformers using the
with config_context(output_transform="pandas")
context manager as an alternative to global configuration and localset_output
calls. We should also specify the precedence: callingset_output
explicitly overrides any configuration whilewith config_context
locally overrides the global configuration previously defined withsklearn.set_config
.
+1. For me, it's fine to implement it this way without having it mentioned in the SLEP. However, if it helps to get agreement and voting faster, we can add it. |
What about SLEP006? |
Came here to say what @jnothman said. |
Yes, As for me, I am happy with marking |
Making |
We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself. |
If we tell users it's experimental, we should give them an option to use the library w/o the experimental features and w/o getting deprecation warnings, which is not really possible, cause if they don't use the experimental feature, they'll get the deprecation warning. If we want to tell them it's experimental, then we'd have to allow them to pass things the old way w/o warning them, and then after the experimental phase is over, then the deprecation cycle begins. I really don't wanna go down that route. |
Let‘s discuss here SLEP018 only. |
@ogrisel I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it? |
---------------------- | ||
|
||
There are no backward compatibility concerns, because the ``set_output`` method | ||
is a new API. Third party transformers can opt-in to the API by defining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they have to opt-in to respect the global flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no backward compatibility concern as long as we don't change the default of sklearn.set_config(transform_output=XXX)
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a backward compatibility concern, but there is an issue around if a third party transformer should respect the global flag. Concretely:
sklearn.set_config(transform_output="pandas")
# Should we require this to be a dataframe?
third_party_transformer.transform(X_df)
I'm leading toward letting the library decide if it wants to respect the global configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly that was the case that seemed underspecified to me. I'm ok to leave it up to the library, which means we won't add it to the common tests that this errors out if it's not supported. Not sure if it's worth adding that to the doc as a sentence/half-sentence? Otherwise the doc doesn't really tell the third party estimator authors what they should be doing.
slep018/proposal.rst
Outdated
1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__ | ||
proposes that if the input is a DataFrame than the output is a DataFrame. | ||
2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container for dense | ||
and sparse data that contains feature names. This SLEP also proposes a custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sparse data is now in future work, right?
slep018/proposal.rst
Outdated
Sparse Data | ||
........... | ||
|
||
The Pandas DataFrame is not suitable to provide column names because it has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Pandas DataFrame is not suitable to provide column names because it has | |
The Pandas DataFrame is not suitable to provide column names for sparse data because it has |
# X_trans_df is a pandas DataFrame | ||
X_trans_df = scalar.transform(X_df) | ||
|
||
The index of the output DataFrame must match the index of the input. If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is sparse data treated? If I do set_output(transform="pandas")
on a OneHotEncoder(sparse=True)
, will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?
For CountVectorizer
, erroring seems natural, for OHE it seems strange to require the user to set the output intention in two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I do set_output(transform="pandas") on a OneHotEncoder(sparse=True), will it error or will it produce a dense pandas array? How about estimators that don't have an explicit dense support, like CountVectorizer?
Error in both cases.
for OHE it seems strange to require the user to set the output intention in two places.
I think it's strange not to error. If sparse=True
and a dense pandas array is returned, then it is not consistent with sparse=True
. As long as pandas + sparse is not supported, I prefer the explicitness of configuring in two places.
From a workflow point of view, the OHE will be in a pipeline, so it will end up to be one extra configuration (sparse=True
). This is because set_output
will end up being called on pipeline which configures everything:
preprocessor = ColumnTransformer([("cat", OneHotEncoder(sparse=True), ...])
pipeline = make_pipeline(preprocessor, ...)
pipeline.set_output(transform="pandas")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, in a pipeline it's not so bad. It might be worth calling out in the OHE documentation or in the error message? I agree it's better to be explicit, but it might also be surprising to users who don't know that the default is sparse=True
, in particular because it might be invisible if used in a column transformer.
---------------------- | ||
|
||
There are no backward compatibility concerns, because the ``set_output`` method | ||
is a new API. Third party transformers can opt-in to the API by defining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no backward compatibility concern as long as we don't change the default of sklearn.set_config(transform_output=XXX)
, correct?
Co-authored-by: Christian Lorentzen <[email protected]>
Having a global configuration does introduce this inconsistency with third party estimators. I prefer not require it from third party estimators, but it would be a better UX if they choose to use the global configuration.
If users complain, then we direct them to the third party's repo? We already do this sometimes, if a library does not define a scikit-learn compatible estimator. I think third party libraries will be motivated to use the global configuration anyways becomes of the better UX. |
I think, this SLEP should motivate why it considers pandas and not any other dataframe (libraries such as pyarrow, polars, …) as data container. |
I updated the PR:
|
This SLEP proposes a
set_output
API used to configure the output oftransform
. The overall idea is to useset_output(transform="pandas_or_sparse")
, which outputs a pandas dataframe for dense data and a scipy sparse matrix for sparse data.Use cases
I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:
Future Extensions
The Pandas DataFrame is not suitable to provide column names for sparse data because it has performance issues as shown in #16772. A future extension to this SLEP is to have a
"pandas_or_namedsparse"
option. This option will use a scikit-learn specific sparse container that subclasses SciPy's sparse matrices. This sparse container includes the sparse data, feature names and index. This enables pipelines with Vectorizers without performance issues::CC @amueller @glemaitre @lorentzenchr