Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value #419

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

Morgan-Sell
Copy link
Collaborator

Closes #394.

The transformer accepts a dictionary that defines how numeric variables will be discretized/organized into bins. The transformer calculates and returns the average for the respective bins.

@Morgan-Sell
Copy link
Collaborator Author

@solegalli,

I was initially thinking that this class would be similar to ArbitraryDiscretiser where the user passes the desired bins. However, I'm now realizing we may want to incorporate the EqualWidthDiscretiser and EqualFrequencyDiscretiser. What are your thoughts?

@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli,

With regards to implementation, are you envisioning the following order of operations or something of the sort?

  • var_A is a numeric variable in dataframe X.
  • fit() copies the original values of var_A.
  • A discretizer, e.g. EqualFrequencyDiscretiser, discretizes and assigns a bin number to the respective values. Let's call this variable var_A_disc.
  • MeanEncoder.fit() accepts var_A_disc as X and var_A as y.

In the case of multiple variables, the code would iterate through the numeric variables while performing the abovementioned operations.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Morgan-Sell

thank you for getting started with this class.

The implementation looks good to me. I think you kind of answered your own questions, right? or is there anything else I can add?

Main thing, for the first implementation I would not include the arbitrary discretizer. The rest looks good.

Thank you!

variables: Union[None, int, str, List[Union[str, int]]] = None,
bins: int = 5,
strategy: str = "equal_frequency",
binning_dict: Dict[Union[str, int], List[Union[str, int]]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need this parameter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to binning_dict? Yes, makes sense, given that we're removing the ArbitraryDiscretiser.

binning_dict: Dict[Union[str, int], List[Union[str, int]]] = None,
errors: str = "ignore",
) -> None:
# TODO: do we include ArbitraryDiscretiser?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to just use the equal bin or equal frequency only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming equal bin is analogous to equal width.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry

@Morgan-Sell Morgan-Sell force-pushed the target_mean_discretiser branch from 7173bbe to b6fac50 Compare April 18, 2022 00:14
@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli,

I'm developing the tests for TargetMeanDiscretiser(). I see that the team has made substantial edits to the unit tests. I'm using the EqualFrequencyDiscretiser and EqualWidthDiscretiser tests as guides.

Both classes have the following test:

X_t = [x for x in range(0, 10)]
assert all(x for x in X["var"].unique() if x not in X_t)

This seems to check whether the transformed variable X["var"] possesses an integer in the range of 0 to 9. What's the test's rationale?

Gracias!

@Morgan-Sell
Copy link
Collaborator Author

@solegalli,

I'm creating the TargetMeanDiscretiser unit tests. I mentioned in an earlier post that I was using the EqualWidthDiscretiser and EqualFrequencyDiscretiser unit tests as guides. However, I think those unit tests do not reflect our most-recent practices, e.g. @pytest.mark.parameterize. At the same time, I see that the team has made significant edits to the tests directory. Which unit tests should I use as a guide?

Gracias!

feature_engine/discretisation/target_mean.py Show resolved Hide resolved
feature_engine/discretisation/target_mean.py Outdated Show resolved Hide resolved
feature_engine/discretisation/target_mean.py Outdated Show resolved Hide resolved
feature_engine/discretisation/target_mean.py Outdated Show resolved Hide resolved
feature_engine/discretisation/target_mean.py Outdated Show resolved Hide resolved
binning_dict: Dict[Union[str, int], List[Union[str, int]]] = None,
errors: str = "ignore",
) -> None:
# TODO: do we include ArbitraryDiscretiser?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry

@solegalli
Copy link
Collaborator

which line of code are you referring to?

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Morgan-Sell

Sorry, late response. I added a few comments here and there, minor. We do have the target mean functionality already, no need to code it again.

Basically, we need a pipeline with the discretizer returning the bins as object, followed by the TargetEncoder.

The discretisers tests in general are a bit poor, I agree with you. This is some legacy code. They are some of our first transformers.

Having said this, most tests would be done out-of-the box already by adding the new transformer to the test_check_estimators.py which you have done already.

So for the new tests, I would just ensure that the transformer outputs the expected result.

@solegalli
Copy link
Collaborator

@solegalli,

I'm creating the TargetMeanDiscretiser unit tests. I mentioned in an earlier post that I was using the EqualWidthDiscretiser and EqualFrequencyDiscretiser unit tests as guides. However, I think those unit tests do not reflect our most-recent practices, e.g. @pytest.mark.parameterize. At the same time, I see that the team has made significant edits to the tests directory. Which unit tests should I use as a guide?

Gracias!

I would say, have a look at the tests for the TargetMeanRegressor and TargetMeanClassifier.

@Morgan-Sell
Copy link
Collaborator Author

@solegalli,

Soy un boludo! I misinterpreted the instructions, which are in the "TargetMeanDiscretiser". Consequently, I created the _encode_X() and the rest of the boludese (castellano cientifico).

Let me know about adding the ArbitraryDiscretiser(). I'll update the test pronto!

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Morgan-Sell

This is looking really good thank you.

The tests are failing, any idea which ones and how to fix it?

Also, do you run isort and flake8 on the new files? that would order the imports at the top and tell you which ones are unused, we need to remove to pass the code style tests.

Next steps, documentation! hurray!

We need to expand the docstrings here, and then add the transformer in docs/index and also within the api and user_guide folders, in the last one with a nice demo.

I guess you can take some inspiration from the MeanEncoder's documentation?

("discretiser", self._make_discretiser()),
("encoder", MeanEncoder(
variables=self.variables_numerical_,
ignore_format=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this param should be False.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought ignore_format should be True to encode the integers that represent the bins created by the discretiser. Otherwise, the transformer throws an error because the variables that are being encoded are not strings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also works. I think it is safest to tell the discretiser to return object variables (return_object=True) and then use the default parameters of the encoder.

tests/test_discretisation/test_target_mean_discretiser.py Outdated Show resolved Hide resolved
@Morgan-Sell
Copy link
Collaborator Author

hola @solegalli,

I worked on the documentation (still need to review). I've added the class to multiple index.rst files; however, it seems that I'm still missing the table of contents, i.e., toctree. Where is it?

Also, the new testing methodology is really cool! I'm still unclear exactly how it operates though. Maybe we can discuss it sometime?

Abrazo!

@solegalli
Copy link
Collaborator

rked on the documentation (still nee

You need to add the transformer to this file: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/discretisation/index.rst

I am off on hols since Friday, maybe we can do it on my return?

Cheers

X, y = check_X_y(X, y)

# identify numerical variables
self.variables_numerical_ = _find_or_check_numerical_variables(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the attribute should be self.variables_

that should resolve the test failing

@solegalli
Copy link
Collaborator

One of the common tests is failing with this error:

E AttributeError: 'TargetMeanDiscretiser' object has no attribute 'variables_'

@Morgan-Sell
Copy link
Collaborator Author

Gracias, @solegalli!

Enjoy your vacay! And, yes, it would be great to review the tests when you return ;)

@solegalli
Copy link
Collaborator

@Morgan-Sell

Did I completely forget about this PR?

Looks like it is more or less good to go?

@solegalli
Copy link
Collaborator

Ok, I went through this quickly, the docstrings still need to be added.

The idea of this transformer was to perform discretization, and then replace the values by the mean of the target

Now this functionality could be achieved already by combining the equal-frequency or equal-width discretizer with the MeanEncoder in a pipeline.

So I am having second thoughts as to whether we should create this class, when it is possible to obtain the same result with the transformers that we already have.

Would you mind if we keep this PR on hold?

Apologies :/

@solegalli solegalli added the wontfix This will not be worked on label Jul 5, 2022
@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli,

I'll pause on editing the docstrings.

How did you envision this class to be different than combining the equal-frequency or equal-width discretizer with the MeanEncoder in a pipeline? I thought algorithmically this was the purpose.

I guess one way to think of this class as a tool in the feature-engine toolbox that a user may not be aware of until reading the feature-engine docs, assuming people read the documentation in depth ;)

@solegalli
Copy link
Collaborator

Hi @Morgan-Sell

There is a paper in the KDD 2009 data science competition, where the authors discretize all numerical variables, and then replace their values by the target mean. And they use this as predictions to basically select features.

For categorical variables, they replace the categories by the target mean. And they proceed to select features in the same way.

So the MeanEncoder and the TargetMeanDiscretizer aim to do the same thing: replace the categories by the target mean and sort the variables into intervals and then replace the intevals by the target mean.

This is where my thinking came from.

Most users would not be aware of this possibility, because well, they don't know all the literature and probably don't have time to catch up. So having the class ready, is helpful, in the sense that it makes the transformation "visible".

On the other hand, it is more work for us, given that the encoding could be done already with the classes that already exist. So yeah, I am in 2 minds.

What do you think?

@solegalli
Copy link
Collaborator

I made a PR with some changes here:

Morgan-Sell#11

only 2 things remain doing and then we are good to go!

@solegalli solegalli removed the wontfix This will not be worked on label Aug 17, 2022
@solegalli
Copy link
Collaborator

@Morgan-Sell this in reply to whether or not we should move forward. I am still in 2 minds. What's your view?

@Morgan-Sell
Copy link
Collaborator Author

I'm leaning towards creating the transformer.

I see Python packages as toolboxes and like to consider the novice craftsman - e.g. me - when designing the tools. Like you said, people are going to open the toolbox and be amzed by the collection of our wonderful tools. It's unlikely for a novice to make the connection between the numerical discretizer and mean encoder.

Also, the use may be unaware of sklearn's Pipeline class.

I think we've already done most of the work.

Once created, how much maintenance is required for each class? I'm assuming it's not much but could depend on the class/transformer's complexity. I guess this class will need to be updated whenever changes are made to the numerical discretizers or mean encoder.

I'm still leaning towards creating the class for the newbies as I'm one of them ;)

@solegalli
Copy link
Collaborator

Sounds good, Let's finish the IV selector, and then we come back to this one. In the meantime, we sleep on the idea a bit longer :p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

new discretiser: sorts variables in bins, and replaces bins by target mean value
2 participants