Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add PandasCategoricalEncoder to encode categorical features as pandas categorical #828

Open
ClaudioSalvatoreArcidiacono opened this issue Dec 3, 2024 · 2 comments · May be fixed by #829

Comments

@ClaudioSalvatoreArcidiacono
Copy link
Contributor

Some libraries like LightGBM are well integrated with pandas categorical
types
.
I could not find a nice implementation to encode categorical features as pandas
categorical columns while preserving the categories across different datasets. I would like to
propose the addition of a PandasCategoricalEncoder to the feature_engine library to
address this issue.

Is your feature request related to a problem? Please describe.
Yes, I often encounter issues when working with categorical data in pandas. The current
methods do not ensure consistent encoding across different datasets, leading to
potential errors.

Describe the solution you'd like
I would like to implement the PandasCategoricalEncoder class, which will transform
categorical features into pandas categorical types. This encoder will ensure that
categories are encoded consistently between training and testing datasets, and it will
handle unseen categories gracefully based on specified parameters.

Describe alternatives you've considered
I have considered using existing categorical encoding libraries, but they do not provide
such feature.

Additional context
The PandasCategoricalEncoder will include features such as handling missing values,
allowing for flexible unseen category management, and providing methods for inverse
transformation to retrieve original values. This will enhance the usability and
reliability of categorical data processing in pandas.

@ClaudioSalvatoreArcidiacono ClaudioSalvatoreArcidiacono linked a pull request Dec 3, 2024 that will close this issue
@solegalli
Copy link
Collaborator

Hi @ClaudioSalvatoreArcidiacono

Our encoders try to handle pandas categorical variables within their functionality. They should be able to take variables that are of type object and of type categorical simultaneously, and for those that are of type categorical, we do have some functionality to make it work (i.e., add categories from the train set to the test set to ensure compatibility).

Did you test our encoders with a dataset where these did not work?

@ClaudioSalvatoreArcidiacono
Copy link
Contributor Author

ClaudioSalvatoreArcidiacono commented Dec 5, 2024

Hey @solegalli,

This encoder is very similar to OrdinalEncoder indeed, however from what I understood it is not possible to get pandas categorical columns as output of OrdinalEncoder, right? The other way around is definetly possible with OrdinalEncoder.

I tried to add this feature as an extra option for OrdinalEncoder, but since it shares the actual encoding part with a lot of other encoders in the library I felt is would have been easier to create a new class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants