Feat/binarizer without column transformer #791

d-lazenby · 2024-08-16T06:37:38Z

Issue raised here

Notes on Code
The BinaryDiscretiser class is implemented in binariser.py, located with the other discretisers, and takes a parameter threshold to determine where to split the interval.

After standard checks and type checks for threshold, there's a check to see if the threshold is in min(x) < threshold < max(x) for each feature x (L167). If not, then x isn't transformed and the user is notified of this. The remaining features are passed to a list for transformation.
Because of the above, the transform method from the BaseDiscretiser is repeated here, only iterating through the new list of features that passed the threshold check rather than the list in self.variables_. I'm not sure if there's a cleaner way of doing this. We could also modify the self.variables_ attribute directly in the fit method instead, which might make sense since then it would contain only features that were actually transformed, and there would be no need to re-implement the transform method.

Other notes

I updated the docs apart from the user_guide since this might change depending on further changes to the implementation
I've tested on an sklearn Pipeline and it seems to work fine but haven't included explicit tests for that as they were missing for the other discretisers. Let me know if that's something you'd want.
It might be nice to have functionality where the user can pass a set of different thresholds for each feature passed to the class (could be corresponding lists for threshold and variables parameters, or a dictionary of pairs).
The threshold check output is written to stdout at the moment, but this should perhaps be given as a warning instead.

Finally
This is my first time contributing to open source – all feedback is very welcome!

… for consistency with other discretisers

d-lazenby · 2024-08-16T09:51:10Z

Hi @solegalli, I forgot to write in the PR, could someone review this please?

Some of the style checks don't pass. I noticed that when I run isort on binariser.py some changes are made to the imports section but then running black on the file changes them back, so that might be the cause of the issue.

VascoSch92 · 2024-08-16T19:57:55Z

feature_engine/discretisation/binariser.py

+                ]
+
+        if failed_threshold_check:
+            print(


this should be a logger not a print I think

This should be a warning instead of print. Could we change?

solegalli

Hey @d-lazenby

Thank you so much for the PR and sorry for the delayed response. You put a lot of work already well done and thanks!

This class at the moment works like the Binarizer from sklearn, it takes a threshold and then splits the variables in 2 intervals around the threshold. So far so good. The only additional thing that I could consider is whether we allow the user to pass a dictionary with different thresholds per variable for example 0 for vars a and b, and 2 for vars c and d. (See arbitrary discretiser for params at init).

Since this class returns only 2 intervals, I am inclined to suggest that we only allow the output of integers, 0 if it is smaller than threshold and 1 otherwise. I don't see that having boundaries from -inf to threshold and then threshold to inf could be useful for the user. And that would simplify the class a lot.

What do you think? Should we change that?

I left a few comments across the files also. Would you be able to take a look?

Thanks a lot!

solegalli · 2024-08-22T17:32:16Z

feature_engine/discretisation/binariser.py

+    ----------
+    {variables}
+
+    threshold: int, float, default=None


We need to default threshold to a value that can actually be used for splitting. Sklearn binarizer defaults to 0, so we could take the same approach and make the threshold default to zero. Could you pls change that?

Hi @solegalli, I’m back working on this now and wanted to check something before proceeding.

The functionality that the user passes a dictionary where the features are the keys and the thresholds are the values (float or int) is good. I was wondering if we wanted just this, or this with an additional possibility of passing a list to variables that all default to zero threshold. My feeling is the latter might be a bit unnecessary, so I'd be inclined to insist on a user-inputted dictionary (as with the ArbitraryDiscretiser), remove variables as a user input and circumvent the need for default values for the thresholds but let me know if you disagree. If we take just the user-inputted dict route, we could set self.variables_ to be the keys of the threshold dictionary so the user can easily access them in a way that’s consistent with the other transformers.

solegalli · 2024-08-22T17:32:38Z

feature_engine/discretisation/binariser.py

+    n_features_in_=_n_features_in_docstring,
+    fit_transform=_fit_transform_docstring,
+)
+class BinaryDiscretiser(BaseDiscretiser):


I wonder whether we should use Binarizer as name, like sklearn's class. Thoughts?

I started with Binarizer, funnily enough :)

I changed it to be consistent with all the other discretiser class names in feature engine (EqualFrequencyDiscretiser, EqualWidthDiscretiser, etc.) since it looked strange to me to have a single -ize spelling in there and omit the 'Discretiser' part but I'm happy to go with your preference.

solegalli · 2024-08-22T17:33:42Z

feature_engine/discretisation/binariser.py

+
+    {return_object}
+
+    {return_boundaries}


Do we need to return boundaries? It's just 2 intervals, either it is greater or smaller than threshold. A priori, I think we should drop this paramter.

solegalli · 2024-08-22T17:34:34Z

feature_engine/discretisation/binariser.py

+
+    {return_boundaries}
+
+    {precision}


I also think we don't need precision, because we don't have interval boundaries. The boundaries are -inf to threshold and then threshold to inf. We can drop this one as well.

solegalli · 2024-08-22T17:35:10Z

feature_engine/discretisation/binariser.py

+
+    Attributes
+    ----------
+    {binner_dict_}


I am not sure we need a binner dict for this class. We don't have a sequence of boundaries that vary variable per variable.

solegalli · 2024-08-22T17:35:29Z

feature_engine/discretisation/binariser.py

+    pandas.cut
+    sklearn.preprocessing.KBinsDiscretizer
+
+    References


we need to drop these references.

solegalli · 2024-08-22T17:36:23Z

feature_engine/discretisation/binariser.py

+        precision: int = 3,
+    ) -> None:
+
+        if threshold is None:


We can't allow the default value to be a non permitted value. Default to 0 instead.

solegalli · 2024-08-22T17:37:38Z

feature_engine/discretisation/binariser.py

+                # Omit these features from transformation step
+                failed_threshold_check.append(var)
+            else:
+                self.binner_dict_[var] = [


the binner dict will be the same for all variables. I am not sure it is useful for the user. We should not create this param

solegalli · 2024-08-22T17:38:01Z

feature_engine/discretisation/binariser.py

+                ]
+
+        if failed_threshold_check:
+            print(


This should be a warning instead of print. Could we change?

solegalli · 2024-08-22T17:39:35Z

feature_engine/discretisation/binariser.py

+            )
+
+        # A list of features that satisfy threshold check and will be transformed
+        self.variables_trans_ = [


we normally store the variables that will be transformed in variables_ But I am not convinced that we need to exclude the variables that will not be transformed. This is something for the user to take care of, not feature-engine.

We can raise a warning, but that is as far as I would go.

d-lazenby · 2024-08-25T18:11:01Z

Hi @solegalli, thanks a lot for looking at this, all your suggestions look good. I’m out of action for the next couple of weeks but will be in touch with the code amendments when I’m back. All the best

…

On Thu, 22 Aug 2024 at 18:44, Soledad Galli ***@***.***> wrote: ***@***.**** commented on this pull request. Hey @d-lazenby <https://github.com/d-lazenby> Thank you so much for the PR and sorry for the delayed response. You put a lot of work already well done and thanks! This class at the moment works like the Binarizer from sklearn, it takes a threshold and then splits the variables in 2 intervals around the threshold. So far so good. The only additional thing that I could consider is whether we allow the user to pass a dictionary with different thresholds per variable (see arbitrary discretiser). Since this class returns only 2 intervals, I am inclined to suggest that we only allow the output of integers, 0 if it is smaller than threshold and 1 otherwise. I don't see that having boundaries from -inf to threshold and then threshold to inf could be useful for the user. And that would simplify the class a lot. What do you think? Should we change that? I left a few comments across the files also. Would you be able to take a look? Thanks a lot! ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + where the value `threshold`, the point at which the interval is divided, is + determined by the user. + + The BinaryDiscretiser() works only with numerical variables. + A list of variables can be passed as argument. Alternatively, the discretiser + will automatically select all numerical variables. + + The BinaryDiscretiser() first finds the boundaries for the intervals for + each variable. Then, it transforms the variables, that is, sorts the values into + the intervals. + + Parameters + ---------- + {variables} + + threshold: int, float, default=None We need to default threshold to a value that can actually be used for splitting. Sklearn binarizer defaults to 0, so we could take the same approach and make the threshold default to zero. Could you pls change that? ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + + ***@***.***( + return_object=_return_object_docstring, + return_boundaries=_return_boundaries_docstring, + precision=_precision_docstring, + binner_dict_=_binner_dict_docstring, + fit=_fit_discretiser_docstring, + transform=_transform_discretiser_docstring, + variables=_variables_numerical_docstring, + variables_=_variables_attribute_docstring, + feature_names_in_=_feature_names_in_docstring, + n_features_in_=_n_features_in_docstring, + fit_transform=_fit_transform_docstring, +) +class BinaryDiscretiser(BaseDiscretiser): I wonder whether we should use Binarizer as name, like sklearn's class. Thoughts? ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + will automatically select all numerical variables. + + The BinaryDiscretiser() first finds the boundaries for the intervals for + each variable. Then, it transforms the variables, that is, sorts the values into + the intervals. + + Parameters + ---------- + {variables} + + threshold: int, float, default=None + Desired value at which to divide the interval. + + {return_object} + + {return_boundaries} Do we need to return boundaries? It's just 2 intervals, either it is greater or smaller than threshold. A priori, I think we should drop this paramter. ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + The BinaryDiscretiser() first finds the boundaries for the intervals for + each variable. Then, it transforms the variables, that is, sorts the values into + the intervals. + + Parameters + ---------- + {variables} + + threshold: int, float, default=None + Desired value at which to divide the interval. + + {return_object} + + {return_boundaries} + + {precision} I also think we don't need precision, because we don't have interval boundaries. The boundaries are -inf to threshold and then threshold to inf. We can drop this one as well. ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + Parameters + ---------- + {variables} + + threshold: int, float, default=None + Desired value at which to divide the interval. + + {return_object} + + {return_boundaries} + + {precision} + + Attributes + ---------- + {binner_dict_} I am not sure we need a binner dict for this class. We don't have a sequence of boundaries that vary variable per variable. ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + {n_features_in_} + + Methods + ------- + {fit} + + {fit_transform} + + {transform} + + See Also + -------- + pandas.cut + sklearn.preprocessing.KBinsDiscretizer + + References we need to drop these references. ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + x + 1 56 + 0 44 + Name: count, dtype: int64 + """ + + def __init__( + self, + threshold: Union[None, int, float] = None, + variables: Union[None, int, str, List[Union[str, int]]] = None, + return_object: bool = False, + return_boundaries: bool = False, + precision: int = 3, + ) -> None: + + if threshold is None: We can't allow the default value to be a non permitted value. Default to 0 instead. ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + y: None + y is not needed in this encoder. You can pass y or None. + """ + + # check input dataframe + X = super().fit(X) + + failed_threshold_check = [] + self.binner_dict_ = {} + for var in self.variables_: + # Check that threshold is within range + if (self.threshold < min(X[var])) or (self.threshold > max(X[var])): + # Omit these features from transformation step + failed_threshold_check.append(var) + else: + self.binner_dict_[var] = [ the binner dict will be the same for all variables. I am not sure it is useful for the user. We should not create this param ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + failed_threshold_check.append(var) + else: + self.binner_dict_[var] = [ + float("-inf"), + np.float64(self.threshold), + float("inf"), + ] + + if failed_threshold_check: + print( + "threshold outside of range for one or more variables." + f" Features {failed_threshold_check} have not been transformed." + ) + + # A list of features that satisfy threshold check and will be transformed + self.variables_trans_ = [ we normally store the variables that will be transformed in variables_ But I am not convinced that we need to exclude the variables that will not be transformed. This is something for the user to take care of, not feature-engine. We can raise a warning, but that is as far as I would go. ------------------------------ In feature_engine/discretisation/binariser.py <#791 (comment)> : > + failed_threshold_check = [] + self.binner_dict_ = {} + for var in self.variables_: + # Check that threshold is within range + if (self.threshold < min(X[var])) or (self.threshold > max(X[var])): + # Omit these features from transformation step + failed_threshold_check.append(var) + else: + self.binner_dict_[var] = [ + float("-inf"), + np.float64(self.threshold), + float("inf"), + ] + + if failed_threshold_check: + print( This should be a warning instead of print. Could we change? — Reply to this email directly, view it on GitHub <#791 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJ7GOCL2JN4MJFZ7ZTL4Q2DZSYPPRAVCNFSM6AAAAABMTPWLYWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJVGI2TIOBSHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

solegalli · 2024-08-27T16:38:38Z

Sounds good!

d-lazenby added 15 commits August 5, 2024 17:38

Add checks for threshold to be in range of variables. Write fit method.

98457b0

Removing files with -z spelling

6cdf958

Initial commit of Binariser class

5b1927d

Committing tests

c2b138a

typo: fixing typo in DecisionTreeDiscretiser

008d495

Adding Binariser to list of transformers

994f119

Adding Binariser to the index

3a85d33

Adding Binariser to the API index

40bd686

Renaming Binariser to BinaryDiscretiser to avoid naming conflicts and…

7675058

… for consistency with other discretisers

Updating to BinaryDiscretiser

9d3efab

Updating to BinaryDiscretiser

85c7831

Removing renamed file

0436a69

Adding BinaryDiscretiser to api docs

bf85e55

Adding BinaryDiscretiser

2ee9f22

typo: Fixing typo.

ba71711

VascoSch92 reviewed Aug 16, 2024

View reviewed changes

solegalli reviewed Aug 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/binarizer without column transformer #791

Feat/binarizer without column transformer #791

d-lazenby commented Aug 16, 2024 •

edited

Loading

d-lazenby commented Aug 16, 2024

VascoSch92 Aug 16, 2024

solegalli Aug 22, 2024

solegalli left a comment •

edited

Loading

solegalli Aug 22, 2024

d-lazenby Sep 11, 2024

solegalli Aug 22, 2024

d-lazenby Sep 11, 2024

solegalli Aug 22, 2024

solegalli Aug 22, 2024

solegalli Aug 22, 2024

solegalli Aug 22, 2024

solegalli Aug 22, 2024

solegalli Aug 22, 2024

solegalli Aug 22, 2024

solegalli Aug 22, 2024

d-lazenby commented Aug 25, 2024 via email

solegalli commented Aug 27, 2024

Feat/binarizer without column transformer #791

Are you sure you want to change the base?

Feat/binarizer without column transformer #791

Conversation

d-lazenby commented Aug 16, 2024 • edited Loading

d-lazenby commented Aug 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

solegalli left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-lazenby commented Aug 25, 2024 via email

solegalli commented Aug 27, 2024

d-lazenby commented Aug 16, 2024 •

edited

Loading

solegalli left a comment •

edited

Loading