Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/binarizer without column transformer #791

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

d-lazenby
Copy link

@d-lazenby d-lazenby commented Aug 16, 2024

Issue raised here

Notes on Code
The BinaryDiscretiser class is implemented in binariser.py, located with the other discretisers, and takes a parameter threshold to determine where to split the interval.

  • After standard checks and type checks for threshold, there's a check to see if the threshold is in min(x) < threshold < max(x) for each feature x (L167). If not, then x isn't transformed and the user is notified of this. The remaining features are passed to a list for transformation.
  • Because of the above, the transform method from the BaseDiscretiser is repeated here, only iterating through the new list of features that passed the threshold check rather than the list in self.variables_. I'm not sure if there's a cleaner way of doing this. We could also modify the self.variables_ attribute directly in the fit method instead, which might make sense since then it would contain only features that were actually transformed, and there would be no need to re-implement the transform method.

Other notes

  • I updated the docs apart from the user_guide since this might change depending on further changes to the implementation
  • I've tested on an sklearn Pipeline and it seems to work fine but haven't included explicit tests for that as they were missing for the other discretisers. Let me know if that's something you'd want.
  • It might be nice to have functionality where the user can pass a set of different thresholds for each feature passed to the class (could be corresponding lists for threshold and variables parameters, or a dictionary of pairs).
  • The threshold check output is written to stdout at the moment, but this should perhaps be given as a warning instead.

Finally
This is my first time contributing to open source – all feedback is very welcome!

@d-lazenby
Copy link
Author

Hi @solegalli, I forgot to write in the PR, could someone review this please?

Some of the style checks don't pass. I noticed that when I run isort on binariser.py some changes are made to the imports section but then running black on the file changes them back, so that might be the cause of the issue.

]

if failed_threshold_check:
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a logger not a print I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a warning instead of print. Could we change?

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @d-lazenby

Thank you so much for the PR and sorry for the delayed response. You put a lot of work already well done and thanks!

This class at the moment works like the Binarizer from sklearn, it takes a threshold and then splits the variables in 2 intervals around the threshold. So far so good. The only additional thing that I could consider is whether we allow the user to pass a dictionary with different thresholds per variable for example 0 for vars a and b, and 2 for vars c and d. (See arbitrary discretiser for params at init).

Since this class returns only 2 intervals, I am inclined to suggest that we only allow the output of integers, 0 if it is smaller than threshold and 1 otherwise. I don't see that having boundaries from -inf to threshold and then threshold to inf could be useful for the user. And that would simplify the class a lot.

What do you think? Should we change that?

I left a few comments across the files also. Would you be able to take a look?

Thanks a lot!

----------
{variables}

threshold: int, float, default=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to default threshold to a value that can actually be used for splitting. Sklearn binarizer defaults to 0, so we could take the same approach and make the threshold default to zero. Could you pls change that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @solegalli, I’m back working on this now and wanted to check something before proceeding.

The functionality that the user passes a dictionary where the features are the keys and the thresholds are the values (float or int) is good. I was wondering if we wanted just this, or this with an additional possibility of passing a list to variables that all default to zero threshold. My feeling is the latter might be a bit unnecessary, so I'd be inclined to insist on a user-inputted dictionary (as with the ArbitraryDiscretiser), remove variables as a user input and circumvent the need for default values for the thresholds but let me know if you disagree. If we take just the user-inputted dict route, we could set self.variables_ to be the keys of the threshold dictionary so the user can easily access them in a way that’s consistent with the other transformers.

n_features_in_=_n_features_in_docstring,
fit_transform=_fit_transform_docstring,
)
class BinaryDiscretiser(BaseDiscretiser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should use Binarizer as name, like sklearn's class. Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with Binarizer, funnily enough :)

I changed it to be consistent with all the other discretiser class names in feature engine (EqualFrequencyDiscretiser, EqualWidthDiscretiser, etc.) since it looked strange to me to have a single -ize spelling in there and omit the 'Discretiser' part but I'm happy to go with your preference.


{return_object}

{return_boundaries}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return boundaries? It's just 2 intervals, either it is greater or smaller than threshold. A priori, I think we should drop this paramter.


{return_boundaries}

{precision}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we don't need precision, because we don't have interval boundaries. The boundaries are -inf to threshold and then threshold to inf. We can drop this one as well.


Attributes
----------
{binner_dict_}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need a binner dict for this class. We don't have a sequence of boundaries that vary variable per variable.

pandas.cut
sklearn.preprocessing.KBinsDiscretizer

References
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to drop these references.

precision: int = 3,
) -> None:

if threshold is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't allow the default value to be a non permitted value. Default to 0 instead.

# Omit these features from transformation step
failed_threshold_check.append(var)
else:
self.binner_dict_[var] = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the binner dict will be the same for all variables. I am not sure it is useful for the user. We should not create this param

]

if failed_threshold_check:
print(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a warning instead of print. Could we change?

)

# A list of features that satisfy threshold check and will be transformed
self.variables_trans_ = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally store the variables that will be transformed in variables_ But I am not convinced that we need to exclude the variables that will not be transformed. This is something for the user to take care of, not feature-engine.

We can raise a warning, but that is as far as I would go.

@d-lazenby
Copy link
Author

d-lazenby commented Aug 25, 2024 via email

@solegalli
Copy link
Collaborator

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants