-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/binarizer without column transformer #791
base: main
Are you sure you want to change the base?
Feat/binarizer without column transformer #791
Conversation
… for consistency with other discretisers
Hi @solegalli, I forgot to write in the PR, could someone review this please? Some of the style checks don't pass. I noticed that when I run isort on |
] | ||
|
||
if failed_threshold_check: | ||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a logger not a print I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a warning instead of print. Could we change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @d-lazenby
Thank you so much for the PR and sorry for the delayed response. You put a lot of work already well done and thanks!
This class at the moment works like the Binarizer from sklearn, it takes a threshold and then splits the variables in 2 intervals around the threshold. So far so good. The only additional thing that I could consider is whether we allow the user to pass a dictionary with different thresholds per variable for example 0 for vars a and b, and 2 for vars c and d. (See arbitrary discretiser for params at init).
Since this class returns only 2 intervals, I am inclined to suggest that we only allow the output of integers, 0 if it is smaller than threshold and 1 otherwise. I don't see that having boundaries from -inf to threshold and then threshold to inf could be useful for the user. And that would simplify the class a lot.
What do you think? Should we change that?
I left a few comments across the files also. Would you be able to take a look?
Thanks a lot!
---------- | ||
{variables} | ||
|
||
threshold: int, float, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to default threshold to a value that can actually be used for splitting. Sklearn binarizer defaults to 0, so we could take the same approach and make the threshold default to zero. Could you pls change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @solegalli, I’m back working on this now and wanted to check something before proceeding.
The functionality that the user passes a dictionary where the features are the keys and the thresholds are the values (float or int) is good. I was wondering if we wanted just this, or this with an additional possibility of passing a list to variables
that all default to zero threshold. My feeling is the latter might be a bit unnecessary, so I'd be inclined to insist on a user-inputted dictionary (as with the ArbitraryDiscretiser), remove variables
as a user input and circumvent the need for default values for the thresholds but let me know if you disagree. If we take just the user-inputted dict route, we could set self.variables_
to be the keys of the threshold dictionary so the user can easily access them in a way that’s consistent with the other transformers.
n_features_in_=_n_features_in_docstring, | ||
fit_transform=_fit_transform_docstring, | ||
) | ||
class BinaryDiscretiser(BaseDiscretiser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should use Binarizer as name, like sklearn's class. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with Binarizer, funnily enough :)
I changed it to be consistent with all the other discretiser class names in feature engine (EqualFrequencyDiscretiser, EqualWidthDiscretiser, etc.) since it looked strange to me to have a single -ize spelling in there and omit the 'Discretiser' part but I'm happy to go with your preference.
|
||
{return_object} | ||
|
||
{return_boundaries} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to return boundaries? It's just 2 intervals, either it is greater or smaller than threshold. A priori, I think we should drop this paramter.
|
||
{return_boundaries} | ||
|
||
{precision} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we don't need precision, because we don't have interval boundaries. The boundaries are -inf to threshold and then threshold to inf. We can drop this one as well.
|
||
Attributes | ||
---------- | ||
{binner_dict_} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we need a binner dict for this class. We don't have a sequence of boundaries that vary variable per variable.
pandas.cut | ||
sklearn.preprocessing.KBinsDiscretizer | ||
|
||
References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to drop these references.
precision: int = 3, | ||
) -> None: | ||
|
||
if threshold is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't allow the default value to be a non permitted value. Default to 0 instead.
# Omit these features from transformation step | ||
failed_threshold_check.append(var) | ||
else: | ||
self.binner_dict_[var] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the binner dict will be the same for all variables. I am not sure it is useful for the user. We should not create this param
] | ||
|
||
if failed_threshold_check: | ||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a warning instead of print. Could we change?
) | ||
|
||
# A list of features that satisfy threshold check and will be transformed | ||
self.variables_trans_ = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we normally store the variables that will be transformed in variables_ But I am not convinced that we need to exclude the variables that will not be transformed. This is something for the user to take care of, not feature-engine.
We can raise a warning, but that is as far as I would go.
Hi @solegalli, thanks a lot for looking at this, all your suggestions look
good. I’m out of action for the next couple of weeks but will be in touch
with the code amendments when I’m back. All the best
…On Thu, 22 Aug 2024 at 18:44, Soledad Galli ***@***.***> wrote:
***@***.**** commented on this pull request.
Hey @d-lazenby <https://github.com/d-lazenby>
Thank you so much for the PR and sorry for the delayed response. You put a
lot of work already well done and thanks!
This class at the moment works like the Binarizer from sklearn, it takes a
threshold and then splits the variables in 2 intervals around the
threshold. So far so good. The only additional thing that I could consider
is whether we allow the user to pass a dictionary with different thresholds
per variable (see arbitrary discretiser).
Since this class returns only 2 intervals, I am inclined to suggest that
we only allow the output of integers, 0 if it is smaller than threshold and
1 otherwise. I don't see that having boundaries from -inf to threshold and
then threshold to inf could be useful for the user. And that would simplify
the class a lot.
What do you think? Should we change that?
I left a few comments across the files also. Would you be able to take a
look?
Thanks a lot!
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + where the value `threshold`, the point at which the interval is divided, is
+ determined by the user.
+
+ The BinaryDiscretiser() works only with numerical variables.
+ A list of variables can be passed as argument. Alternatively, the discretiser
+ will automatically select all numerical variables.
+
+ The BinaryDiscretiser() first finds the boundaries for the intervals for
+ each variable. Then, it transforms the variables, that is, sorts the values into
+ the intervals.
+
+ Parameters
+ ----------
+ {variables}
+
+ threshold: int, float, default=None
We need to default threshold to a value that can actually be used for
splitting. Sklearn binarizer defaults to 0, so we could take the same
approach and make the threshold default to zero. Could you pls change that?
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> +
+
***@***.***(
+ return_object=_return_object_docstring,
+ return_boundaries=_return_boundaries_docstring,
+ precision=_precision_docstring,
+ binner_dict_=_binner_dict_docstring,
+ fit=_fit_discretiser_docstring,
+ transform=_transform_discretiser_docstring,
+ variables=_variables_numerical_docstring,
+ variables_=_variables_attribute_docstring,
+ feature_names_in_=_feature_names_in_docstring,
+ n_features_in_=_n_features_in_docstring,
+ fit_transform=_fit_transform_docstring,
+)
+class BinaryDiscretiser(BaseDiscretiser):
I wonder whether we should use Binarizer as name, like sklearn's class.
Thoughts?
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + will automatically select all numerical variables.
+
+ The BinaryDiscretiser() first finds the boundaries for the intervals for
+ each variable. Then, it transforms the variables, that is, sorts the values into
+ the intervals.
+
+ Parameters
+ ----------
+ {variables}
+
+ threshold: int, float, default=None
+ Desired value at which to divide the interval.
+
+ {return_object}
+
+ {return_boundaries}
Do we need to return boundaries? It's just 2 intervals, either it is
greater or smaller than threshold. A priori, I think we should drop this
paramter.
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + The BinaryDiscretiser() first finds the boundaries for the intervals for
+ each variable. Then, it transforms the variables, that is, sorts the values into
+ the intervals.
+
+ Parameters
+ ----------
+ {variables}
+
+ threshold: int, float, default=None
+ Desired value at which to divide the interval.
+
+ {return_object}
+
+ {return_boundaries}
+
+ {precision}
I also think we don't need precision, because we don't have interval
boundaries. The boundaries are -inf to threshold and then threshold to inf.
We can drop this one as well.
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + Parameters
+ ----------
+ {variables}
+
+ threshold: int, float, default=None
+ Desired value at which to divide the interval.
+
+ {return_object}
+
+ {return_boundaries}
+
+ {precision}
+
+ Attributes
+ ----------
+ {binner_dict_}
I am not sure we need a binner dict for this class. We don't have a
sequence of boundaries that vary variable per variable.
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + {n_features_in_}
+
+ Methods
+ -------
+ {fit}
+
+ {fit_transform}
+
+ {transform}
+
+ See Also
+ --------
+ pandas.cut
+ sklearn.preprocessing.KBinsDiscretizer
+
+ References
we need to drop these references.
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + x
+ 1 56
+ 0 44
+ Name: count, dtype: int64
+ """
+
+ def __init__(
+ self,
+ threshold: Union[None, int, float] = None,
+ variables: Union[None, int, str, List[Union[str, int]]] = None,
+ return_object: bool = False,
+ return_boundaries: bool = False,
+ precision: int = 3,
+ ) -> None:
+
+ if threshold is None:
We can't allow the default value to be a non permitted value. Default to 0
instead.
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + y: None
+ y is not needed in this encoder. You can pass y or None.
+ """
+
+ # check input dataframe
+ X = super().fit(X)
+
+ failed_threshold_check = []
+ self.binner_dict_ = {}
+ for var in self.variables_:
+ # Check that threshold is within range
+ if (self.threshold < min(X[var])) or (self.threshold > max(X[var])):
+ # Omit these features from transformation step
+ failed_threshold_check.append(var)
+ else:
+ self.binner_dict_[var] = [
the binner dict will be the same for all variables. I am not sure it is
useful for the user. We should not create this param
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + failed_threshold_check.append(var)
+ else:
+ self.binner_dict_[var] = [
+ float("-inf"),
+ np.float64(self.threshold),
+ float("inf"),
+ ]
+
+ if failed_threshold_check:
+ print(
+ "threshold outside of range for one or more variables."
+ f" Features {failed_threshold_check} have not been transformed."
+ )
+
+ # A list of features that satisfy threshold check and will be transformed
+ self.variables_trans_ = [
we normally store the variables that will be transformed in variables_ But
I am not convinced that we need to exclude the variables that will not be
transformed. This is something for the user to take care of, not
feature-engine.
We can raise a warning, but that is as far as I would go.
------------------------------
In feature_engine/discretisation/binariser.py
<#791 (comment)>
:
> + failed_threshold_check = []
+ self.binner_dict_ = {}
+ for var in self.variables_:
+ # Check that threshold is within range
+ if (self.threshold < min(X[var])) or (self.threshold > max(X[var])):
+ # Omit these features from transformation step
+ failed_threshold_check.append(var)
+ else:
+ self.binner_dict_[var] = [
+ float("-inf"),
+ np.float64(self.threshold),
+ float("inf"),
+ ]
+
+ if failed_threshold_check:
+ print(
This should be a warning instead of print. Could we change?
—
Reply to this email directly, view it on GitHub
<#791 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJ7GOCL2JN4MJFZ7ZTL4Q2DZSYPPRAVCNFSM6AAAAABMTPWLYWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJVGI2TIOBSHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sounds good! |
Issue raised here
Notes on Code
The
BinaryDiscretiser
class is implemented inbinariser.py
, located with the other discretisers, and takes a parameter threshold to determine where to split the interval.BaseDiscretiser
is repeated here, only iterating through the new list of features that passed the threshold check rather than the list inself.variables_
. I'm not sure if there's a cleaner way of doing this. We could also modify theself.variables_
attribute directly in the fit method instead, which might make sense since then it would contain only features that were actually transformed, and there would be no need to re-implement the transform method.Other notes
Finally
This is my first time contributing to open source – all feedback is very welcome!